You are on page 1of 32

Entity-relationship model

A sample Entity-relationship diagram using Chen's notation In software engineering, an entity-relationship model (ER model for short) is an abstract and conceptual representation of data. Entity-relationship modeling is a database modeling method, used to produce a type of conceptual schema or semantic data model of a system, often a relational database, and its requirements in a top-down fashion. Diagrams created by this process are called entity-relationship diagrams or ER diagrams. Overview The first stage of information system design uses these models during the requirements analysis to describe information needs or the type of information that is to be stored in a database. The data modeling technique can be used to describe any ontology (i.e. an overview and classifications of used terms and their relationships) for a certain area of interest. In the case of the design of an information system that is based on a database, the conceptual data model is, at a later stage (usually called logical design), mapped to a logical data model, such as the relational model; this in turn is mapped to a physical model during physical design. Note that, sometimes, both of these phases are referred to as "physical design". A UML metamodel of Extended Entity Relationship models The building blocks: entities, relationships, and attributes

Two related entities

An entity with an attribute

A relationship with an attribute Primary key An entity may be defined as a thing which is recognized as being capable of an independent existence and which can be uniquely identified. An entity is an abstraction from the complexities of some domain. When we speak of an entity we normally speak of some aspect of the real world which can be distinguished from other aspects of the real world. An entity may be a physical object such as a house or a car, an event such as a house sale or a car service, or a concept such as a customer transaction or order. Although the term entity is the one most commonly used, following Chen we should really distinguish between an entity and an entity-type. An entity-type is a category. An entity, strictly speaking, is an instance of a given entity-type. There are usually many instances of an entity-type. Because the term entity-type is somewhat cumbersome, most people tend to use the term entity as a synonym for this term. Entities can be thought of as nouns. Examples: a computer, an employee, a song, a mathematical theorem.

A relationship captures how entities are related to one another. Relationships can be thought of as verbs, linking two or more nouns. Examples: an owns relationship between a company and a computer, a supervises relationship between an employee and a department, a performs relationship between an artist and a song, a proved relationship between a mathematician and a theorem. The model's linguistic aspect described above is utilized in the declarative database query language ERROL, which mimics natural language, constructs. ERROL's semantics and implementation are based on Reshaped relational algebra (RRA), a relational algebra which is adapted to the entity-relationship model and captures its linguistic aspect. Entities and relationships can both have attributes. Examples: an employee entity might have a Social Security Number (SSN) attribute; the proved relationship may have a date attribute. Every entity (unless it is a weak entity) must have a minimal set of uniquely identifying attributes, which is called the entity's primary key. Entity-relationship diagrams don't show single entities or single instances of relations. Rather, they show entity sets and relationship sets. Example: a particular song is an entity. The collection of all songs in a database is an entity set. The eaten relationship between a child and her lunch is a single relationship. The set of all such child-lunch relationships in a database is a relationship set. In other words, a relationship set corresponds to a relation in mathematics, while a relationship corresponds to a member of the relation. Certain cardinality constraints on relationship sets may be indicated as well. Relationships, roles and cardinalities In Chen's original paper he gives an example of a relationship and its roles. He describes a relationship "marriage" and its two roles "husband" and "wife". A person plays the role of husband in a marriage (relationship) and another person plays the role of wife in the (same) marriage. These words are nouns. That is no surprise, naming things requires a noun. However as is quite usual with new ideas, many eagerly appropriated the new terminology but then applied it to their own old ideas. Thus the lines, arrows and crows-feet of their diagrams owed more to the earlier Bachman diagrams than to Chen's relationship diamonds. And they similarly misunderstood other important concepts. In particular, it became fashionable (now almost to the point of exclusivity) to "name" relationships and roles as verbs or phrases. Relationship names A relationship expressed with a single verb implying direction, makes it impossible to discuss the model using the following proper English. For Example: the song and the performer are related by a 'performs' the husband and wife are related by an 'is-married-to'. Expressing the relationships with a noun resolves this: the song and the performer are related by a 'performance' the husband and wife are related by a 'marriage'. Traditionally, the relationships are expressed twice, (using present continuous verb phrases), once in each direction. This gives two English statements per relationship. For example: the song is performed by the performer the performer performs the song Role naming It has also become prevalent to name roles with phrases e.g. is-the-owner-of and is-owned-by etc. Correct nouns in this case are "owner" and "possession". Thus "person plays the role of owner" and "car plays the role of possession" rather than "person plays the role of is-the-owner-of" etc. The use of nouns has direct benefit when generating physical implementations from semantic models. When a person has two relationships with car then it is possible to very simply generate names such as "owner_person" and "driver_person" which are immediately meaningful. Cardinalities However some modifications to the original specification are beneficial. Chen described look-across cardinalities. UML perpetuates this. (As an aside, the Barker-Ellis notation, used in Oracle Designer, uses same-side for minimum cardinality (analogous to optionality) and role, but look-across for maximum cardinality (the crows foot)). Other authors (Merise, Elmasri & Navathe amongst others) prefer same-side for roles and both minimum and maximum cardinalities. Recent researchers (Feinerer, Dullea et. alia) have shown that this is more coherent when applied to n-ary relationships of order >2. In Dullea et. alia "An analysis of structural validity in entity-relationship modeling" one reads "A 'look across' notation such as used in the UML does not effectively represent the semantics of participation constraints imposed on relationships where the degree is higher than binary." In Feinerer it says "Problems arise if we operate under the look-across semantics as used for UML associations. Hartmann investigates this situation and shows how and why different transformations fail." (Although the "reduction" mentioned is spurious as the two diagrams 3.4 and 3.5 are in fact the same) and also "As we will see on the next few pages, the lookacross interpretation introduces several difficulties which prevent the extension of simple mechanisms from binary to n-ary associations." Semantic modeling

The father of ER modeling said in his seminal paper: "The entity-relationship model adopts the more natural view that the real world consists of entities and relationships. It incorporates some of the important semantic information about the real world." He is here in accord with philosophic and theoretical traditions from the time of the Ancient Greek philosophers: Socrates, Plato and Aristotle (428 BC) through to modern epistemology, semiotics and logic of Peirce, Frege and Russell. Plato himself associates knowledge with the apprehension of unchanging Forms (The forms, according to Socrates, are roughly speaking archetypes or abstract representations of the many types of things, and properties) and their relationships to one another. In his original 1976 article Chen explicitly contrasts Entity-Relationship diagrams with record modeling techniques: "The data structure diagram is a representation of the organization of records and is not an exact representation of entities and relationships." Several other authors also support his program: Kent in "Data and Reality" : "One thing we ought to have clear in our minds at the outset of a modeling endeavor is whether we are intent on describing a portion of "reality" (some human enterprise) or a data processing activity." Abrial in "Data Semantics: "... the so called "logical" definition and manipulation of data are still influenced (sometimes unconsciously) by the "physical" storage and retrieval mechanisms currently available on computer systems." Stamper: "They pretend to describe entity types, but the vocabulary is from data processing: fields, data items, values. Naming rules don't reflect the conventions we use for naming people and things; they reflect instead techniques for locating records in files." In Jackson's words: "The developer begins by creating a model of the reality with which the system is concerned the reality which furnishes its [the system's] subject matter ..." Elmasri, Navathe: "The ER model concepts are designed to be closer to the users perception of data and are not meant to describe the way in which data will be stored in the computer." A semantic model is a model of concepts; it is sometimes called a "platform independent model". It is an intensional model. At the latest since Carnap, it is well known that:[11] "...the full meaning of a concept is constituted by two aspects, its intension and its extension. The first part comprises the embedding of a concept in the world of concepts as a whole, i.e. the totality of all relations to other concepts. The second part establishes the referential meaning of the concept, i.e. its counterpart in the real or in a possible world". An extensional model is that which maps to the elements of a particular methodology or technology, and is thus a "platform specific model". The UML specification explicitly states that associations in class models are extensional and this is in fact self evident by considering the extensive array of additional "adornments" provided by the specification over and above those provided by any of the prior candidate "semantic modelling languages"."UML as a Data Modeling Notation, Part 2" Diagramming conventions

Various methods of representing the same one to many relationship. In each case, the diagram shows the relationship between a person and a place of birth: each person must have been born at one, and only one, location, but each location may have had zero or more people born at it.

Two related entities shown using Crow's Foot notation. In this example, an optional relationship is shown between Artist and Song; the symbols closest to the song entity represents "zero, one, or many", whereas a song has "one and only one" Artist. The former is therefore read as, an Artist (can) perform(s) "zero, one, or many" song(s). Chen's notation for entityrelationship modeling uses rectangles to represent entities, and diamonds to represent relationships appropriate for firstclass objects: they can have attributes and relationships of their own. Entity sets are drawn as rectangles, relationship sets as diamonds. If an entity set participates in a relationship set, they are connected with a line . Attributes are drawn as ovals and are connected with a line to exactly one entity or relationship set. Cardinality constraints are expressed as follows:

a double line indicates a participation constraint, totality or surjectivity: all entities in the entity set must participate in at least one relationship in the relationship set; an arrow from entity set to relationship set indicates a key constraint, i.e. injectivity: each entity of the entity set can participate in at most one relationship in the relationship set; a thick line indicates both, i.e. bijectivity: each entity in the entity set is involved in exactly one relationship. an underlined name of an attribute indicates that it is a key: two different entities or relationships with this attribute always have different values for this attribute.

Attributes are often omitted as they can clutter up a diagram; other diagram techniques often list entity attributes within the rectangles drawn for entity sets. Related diagramming convention techniques: Bachman notation EXPRESS IDEF1X[12] Martin notation (min, max)-notation of Jean-Raymond Abrial in 1974 UML class diagrams Merise Crow's Foot Notation Crow's Foot notation is used in Barker's Notation, SSADM and Information Engineering. Crow's Foot diagrams represent entities as boxes, and relationships as lines between the boxes. Different shapes at the ends of these lines represent the cardinality of the relationship. Crow's Foot notation was used in the 1980s by the consultancy practice CACI. Many of the consultants at CACI (including Richard Barker) subsequently moved to Oracle UK, where they developed the early versions of Oracle's CASE tools, introducing the notation to a wider audience. The following tools use Crow's Foot notation: ARIS, System Architect, Visio, PowerDesigner, Toad Data Modeler, DeZign for Databases, Devgems Data Modeler, OmniGraffle, MySQL Workbench and SQL Developer Data Modeler. CA's ICASE tool, CA Gen aka Information Engineering Facility also uses this notation. ER diagramming tools There are many ER diagramming tools. Some free software ER diagramming tools that can interpret and generate ER models and SQL and do database analysis are MySQL Workbench (formerly DBDesigner), and Open ModelSphere (opensource). A freeware ER tool that can generate database and application layer code (webservices) is the RISE Editor. Some of the proprietary ER diagramming tools are Avolution, dbForge Studio for MySQL, ER/Studio, ERwin, MEGA International, ModelRight, OmniGraffle, Oracle Designer, PowerDesigner, Rational Rose, Sparx Enterprise Architect, SQLyog, System Architect, Toad Data Modeler, and Visual Paradigm. Some free software diagram tools just draw the shapes without having any knowledge of what they mean, nor do they generate SQL. These include yEd, LucidChart, Kivio, and Dia. Limitations ER models assume information content that can readily be represented in a relational database. They describe only a relational structure for this information. Hence, they are inadequate for systems in which the information cannot readily be represented in relational form, such as with semi-structured data. Furthermore, for many systems, the possible changes to the information contained are nontrivial and important enough to warrant explicit specification. Some authors have extended ER modeling with constructs to represent change, an approach supported by the original author;[13] an example is Anchor Modeling. An alternative is to model change separately, using a process modeling technique. Additional techniques can be used for other aspects of systems. For instance, ER models roughly correspond to just 1 of the 14 different modeling techniques offered by UML. Another limitation: ER modeling is aimed at specifying information from scratch. This suits the design of new, standalone information systems, but is of less help in integrating pre-existing information sources that already define their own data representations in detail. Even where it is suitable in principle, ER modeling is rarely used as a separate activity. One reason for this is today's abundance of tools to support diagramming and other design support directly on relational database management systems. These tools can readily extract database diagrams that are very close to ER diagrams from existing databases, and they provide alternative views on the information contained in such diagrams. In a survey, Brodie and Liu could not find a single instance of entity-relationship modeling inside a sample of ten Fortune 100 companies. Badia and Lemire blame this lack of use on the lack of guidance but also on the lack of benefits, such as lack of support for data integration. Data model

Overview of data modeling context: A data model provides the details of information to be stored, and is of primary use when the final product is the generation of computer software code for an application or the preparation of a functional specification to aid a computer software make-or-buy decision. The figure is an example of the interaction between process and data models. A high-level data model in business or for any functional area is an abstract model that documents and organizes the business data for communication between functional and technical people. It is used to show the data needed and created by business processes. A data model in software engineering is an abstract model that documents and organizes the business data for communication between team members and is used as a plan for developing applications, specifically how data is stored and accessed. According to Hoberman (2009), "A data model is a wayfinding tool for both business and IT professionals, which uses a set of symbols and text to precisely explain a subset of real information to improve communication within the organization and thereby lead to a more flexible and stable application environment."[2] A data model explicitly determines the structure of data or structured data. Typical applications of data models include database models, design of information systems, and enabling exchange of data. Usually data models are specified in a data modeling language. Communication and precision are the two key benefits that make a data model important to applications that use and exchange data. A data model is the medium which project team members from different backgrounds and with different levels of experience can communicate with one another. Precision means that the terms and rules on a data model can be interpreted only one way and are not ambiguous. A data model can be sometimes referred to as a data structure, especially in the context of programming languages. Data models are often complemented by function models, especially in the context of enterprise models. Overview Managing large quantities of structured and unstructured data is a primary function of information systems. Data models describe structured data for storage in data management systems such as relational databases. They typically do not describe unstructured data, such as word processing documents, email messages, pictures, digital audio, and video. The role of data models

How data models deliver benefit.[4] The main aim of data models is to support the development of information systems by providing the definition and format of data. According to West and Fowler (1999) "if this is done consistently across systems then compatibility of data can be achieved. If the same data structures are used to store and access data then different applications can share data. The results of this are indicated above. However, systems and interfaces often cost more than they should, to build, operate, and maintain. They may also constrain the business rather than support it. A major cause is that the quality of the data models implemented in systems and interfaces is poor".[4] "Business rules, specific to how things are done in a particular place, are often fixed in the structure of a data model. This means that small changes in the way business is conducted lead to large changes in computer systems and interfaces".[4] "Entity types are often not identified, or incorrectly identified. This can lead to replication of data, data structure, and functionality, together with the attendant costs of that duplication in development and maintenance".[4] "Data models for different systems are arbitrarily different. The result of this is that complex interfaces are required between systems that share data. These interfaces can account for between 25-70% of the cost of current systems".
[4]

"Data cannot be shared electronically with customers and suppliers, because the structure and meaning of data has not been standardised. For example, engineering design data and drawings for process plant are still sometimes exchanged on paper".[4]

The reason for these problems is a lack of standards that will ensure that data models will both meet business needs and be consistent.[4] According to Hoberman (2009), "A data model is a wayfinding tool for both business and IT professionals, which uses a set of symbols and text to precisely explain a subset of real information to improve communication within the organization and thereby lead to a more flexible and stable application environment."[2] A data model explicitly determines the structure of data or structured data. Typical applications of data models include database models, design of information systems, and enabling exchange of data. Usually data models are specified in a data modeling language.[3] Communication and precision are the two key benefits that make a data model important to applications that use and exchange data. A data model is the medium which project team members from different backgrounds and with different levels of experience can communicate with one another. Precision means that the terms and rules on a data model can be interpreted only one way and are not ambiguous.[2] A data model can be sometimes referred to as a data structure, especially in the context of programming languages. Data models are often complemented by function models, especially in the context of enterprise models.

Three perspectives

The ANSI/SPARC three level architecture. This shows that a data model can be an external model (or view), a conceptual model, or a physical model. This is not the only way to look at data models, but it is a useful way, particularly when comparing models. A data model instance may be one of three kinds according to ANSI in 1975: Conceptual schema : describes the semantics of a domain, being the scope of the model. For example, it may be a model of the interest area of an organization or industry. This consists of entity classes, representing kinds of things of significance in the domain, and relationships assertions about associations between pairs of entity classes. A conceptual schema specifies the kinds of facts or propositions that can be expressed using the model. In that sense, it defines the allowed expressions in an artificial 'language' with a scope that is limited by the scope of the model. The use of conceptual schema has evolved to become a powerful communication tool with business users. Often called a subject area model (SAM) or high-level data model (HDM), this model is used to communicate core data concepts, rules, and definitions to a business user as part of an overall application development or enterprise initiative. The number of objects should be very small and focused on key concepts. Try to limit this model to one page, although for extremely large organizations or complex projects, the model might span two or more pages.[6] Logical schema : describes the semantics, as represented by a particular data manipulation technology. This consists of descriptions of tables and columns, object oriented classes, and XML tags, among other things. Physical schema : describes the physical means by which data are stored. This is concerned with partitions, CPUs, tablespaces, and the like. The significance of this approach, according to ANSI, is that it allows the three perspectives to be relatively independent of each other. Storage technology can change without affecting either the logical or the conceptual model. The table/column structure can change without (necessarily) affecting the conceptual model. In each case, of course, the structures must remain consistent with the other model. The table/column structure may be different from a direct translation of the entity classes and attributes, but it must ultimately carry out the objectives of the conceptual entity class structure. Early phases of many software development projects emphasize the design of a conceptual data model. Such a design can be detailed into a logical data model. In later stages, this model may be translated into physical data model. However, it is also possible to implement a conceptual model directly. Types of data models Hierarchical Model The hierarchical data model organizes data in a tree structure. There is a hierarchy of parent and child data segments. This structure implies that a record can have repeating information, generally in the child data segments. Data is a series of records, which have a set of field values attached to it. It collects all the instances of a specific record together as a record type. These record types are the equivalent of tables in the relational model, and with the individual records being the equivalent of rows. To create links between these record types, the hierarchical model uses Parent Child Relationships. These are a 1:N mapping between record types. This is done by using trees, like set theory used in the relational model, "borrowed" from maths. For example, an organization might store information about an employee, such as name, employee number, department, salary. The organization might also store information about an employee's children, such as name and date of birth. The employee and children data forms a hierarchy, where the employee data represents the parent segment and the children data represents the child segment. If an employee has three children, then there would be three child segments associated with one employee segment. In a hierarchical database the parent-child relationship is one to many. This restricts a child segment to

having only one parent segment. Hierarchical DBMSs were popular from the late 1960s, with the introduction of IBM's Information Management System (IMS) DBMS, through the 1970s. Network Model The popularity of the network data model coincided with the popularity of the hierarchical data model. Some data were more naturally modelled with more than one parent per child. So, the network model permitted the modelling of many-to-many relationships in data. In 1971, the Conference on Data Systems Languages (CODASYL) formally defined the network model. The basic data modelling construct in the network model is the set construct. A set consists of an owner record type, a set name, and a member record type. A member record type can have that role in more than one set, hence the multiparent concept is supported. An owner record type can also be a member or owner in another set. The data model is a simple network, and link and intersection record types (called junction records by IDMS) may exist, as well as sets between them. Thus, the complete network of relationships is represented by several pair wise sets; in each set some (one) record type is owner (at the tail of the network arrow) and one or more record types are members (at the head of the relationship arrow). Usually, a set defines a 1:M relationship, although 1:1 is permitted. The CODASYL network model is based on mathematical set theory. Relational Model (RDBMS - relational database management system) A database based on the relational model developed by E.F. Codd. A relational database allows the definition of data structures, storage and retrieval operations and integrity constraints. In such a database the data and relations between them are organised in tables. A table is a collection of records and each record in a table contains the same fields. Properties of Relational Tables: Values Are Atomic Each Row is Unique Column Values Are of the Same Kind The Sequence of Columns is Insignificant The Sequence of Rows is Insignificant Each Column Has a Unique Name Certain fields may be designated as keys, which mean that searches for specific values of that field will use indexing to speed them up. Where fields in two different tables take values from the same set, a join operation can be performed to select related records in the two tables by matching values in those fields. Often, but not always, the fields will have the same name in both tables. For example, an "orders" table might contain (customer-ID, product-code) pairs and a "products" table might contain (product-code, price) pairs so to calculate a given customer's bill you would sum the prices of all products ordered by that customer by joining on the product-code fields of the two tables. This can be extended to joining multiple tables on multiple fields. Because these relationships are only specified at retrieval time, relational databases are classed as dynamic database management system. The RELATIONAL database model is based on the Relational Algebra. Object/Relational Model Object/relational database management systems (ORDBMSs) add new object storage capabilities to the relational systems at the core of modern information systems. These new facilities integrate management of traditional fielded data, complex objects such as time-series and geospatial data and diverse binary media such as audio, video, images, and applets. By encapsulating methods with data structures, an ORDBMS server can execute complex analytical and data manipulation operations to search and transform multimedia and other complex objects. As an evolutionary technology, the object/relational (OR) approach has inherited the robust transactionand performance-management features of it s relational ancestor and the flexibility of its object-oriented cousin. Database designers can work with familiar tabular structures and data definition languages (DDLs) while assimilating new object-management possibilities. Query and procedural languages and call interfaces in ORDBMSs are familiar: SQL3, vendor procedural languages, and ODBC, JDBC, and proprietary call interfaces are all extensions of RDBMS languages and interfaces. And the leading vendors are, of course, quite well known: IBM, Inform ix, and Oracle. Object-Oriented Model Object DBMSs add database functionality to object programming languages. They bring much more than persistent storage of programming language objects. Object DBMSs extend the semantics of the C++, Smalltalk and Java object programming languages to provide full-featured database programming capability, while retaining native language compatibility. A major benefit of this approach is the unification of the application and database development into a seamless data model and language environment. As a result, applications require less code, use more natural data modelling, and code bases are easier to maintain. Object developers can write complete database applications with a modest amount of additional effort. According to Rao (1994), "The object-oriented database (OODB) paradigm is the combination of objectoriented programming language (OOPL) systems and persistent systems. The power of the OODB comes from the seamless treatment of both persistent data, as found in databases, and transient data, as found in executing programs." In contrast to a relational DBMS where a complex data structure must be flattened out to fit into tables or joined together from those tables to form the in-memory structure, object DBMSs have no performance overhead to store or retrieve a web or hierarchy of interrelated objects. This one-to-one mapping of object programming language objects to database objects has two benefits over other storage approaches: it provides higher performance management of objects, and it enables better management of the complex interrelationships between objects. This makes object DBMSs better suited to support applications such as financial portfolio risk analysis systems, telecommunications service applications, World Wide Web document structures, design and manufacturing systems, and hospital patient record systems, which have complex relationships between data.

Database design Database design is the process of producing a detailed data model of a database. This logical data model contains all the needed logical and physical design choices and physical storage parameters needed to generate a design in a Data Definition Language, which can then be used to create a database. A fully attributed data model contains detailed attributes for each entity. The term database design can be used to describe many different parts of the design of an overall database system. Principally, and most correctly, it can be thought of as the logical design of the base data structures used to store the data. In the relational model these are the tables and views. In an object database the entities and relationships map directly to object classes and named relationships. However, the term database design could also be used to apply to the overall process of designing, not just the base data structures, but also the forms and queries used as part of the overall database application within the database management system (DBMS).[1] The process of doing database design generally consists of a number of steps which will be carried out by the database designer. Usually, the designer must: Determine the relationships between the different data elements. Superimpose a logical structure upon the data on the basis of these relationships.[2] ER Diagram (Entity-relationship model)

A sample Entity-relationship diagram Database designs also include ER (Entity-relationship model) diagrams. An ER diagram is a diagram that helps to design databases in an efficient way. Attributes in ER diagrams are usually modeled as an oval with the name of the attribute, linked to the entity or relationship that contains the attribute. Within the relational model the final step can generally be broken down into two further steps, that of determining the grouping of information within the system, generally determining what are the basic objects about which information is being stored, and then determining the relationships between these groups of information, or objects. This step is not necessary with an Object database.[2] The Design Process The design process consists of the following steps: 1. Determine the purpose of your database - This helps prepare you for the remaining steps. 2. Find and organize the information required - Gather all of the types of information you might want to record in the database, such as product name and order number. 3. Divide the information into tables - Divide your information items into major entities or subjects, such as Products or Orders. Each subject then becomes a table. 4. Turn information items into columns - Decide what information you want to store in each table. Each item becomes a field, and is displayed as a column in the table. For example, an Employees table might include fields such as Last Name and Hire Date. 5. Specify primary keys - Choose each tables primary key. The primary key is a column that is used to uniquely identify each row. An example might be Product ID or Order ID. 6. Set up the table relationships - Look at each table and decide how the data in one table is related to the data in other tables. Add fields to tables or create new tables to clarify the relationships, as necessary. 7. Refine your design - Analyze your design for errors. Create the tables and add a few records of sample data. See if you can get the results you want from your tables. Make adjustments to the design, as needed. 8. Apply the normalization rules - Apply the data normalization rules to see if your tables are structured correctly. Make adjustments to the tables Determining data to be stored In a majority of cases, a person who is doing the design of a database is a person with expertise in the area of database design, rather than expertise in the domain from which the data to be stored is drawn e.g. financial information, biological

information etc. Therefore the data to be stored in the database must be determined in cooperation with a person who does have expertise in that domain, and who is aware of what data must be stored within the system. This process is one which is generally considered part of requirements analysis, and requires skill on the part of the database designer to elicit the needed information from those with the domain knowledge. This is because those with the necessary domain knowledge frequently cannot express clearly what their system requirements for the database are as they are unaccustomed to thinking in terms of the discrete data elements which must be stored. Data to be stored can be determined by Requirement Specification.[4] Normalization In the field of relational database design, normalization is a systematic way of ensuring that a database structure is suitable for general-purpose querying and free of certain undesirable characteristicsinsertion, update, and deletion anomalies that could lead to a loss of data integrity. A standard piece of database design guidance is that the designer should create a fully normalized design; selective denormalization can subsequently be performed, but only for performance reasons. However, some modeling disciplines, such as the dimensional modeling approach to data warehouse design, explicitly recommend non-normalized designs, i.e. designs that in large part do not adhere to 3NF. Normalization consists of normal forms that are 1NF, 2NF, 3NF, BOYCECODD NF (3.5NF), 4NF and 5NF Types of Database design Conceptual schema Main article: Conceptual schema Once a database designer is aware of the data which is to be stored within the database, they must then determine where dependency is within the data. Sometimes when data is changed you can be changing other data that is not visible. For example, in a list of names and addresses, assuming a situation where multiple people can have the same address, but one person cannot have more than one address, the name is dependent upon the address, because if the address is different, then the associated name is different too. However, the other way around is different. One attribute can change and not another. (NOTE: A common misconception is that the relational model is so called because of the stating of relationships between data elements therein. This is not true. The relational model is so named because it is based upon the mathematical structures known as relations.) Logically structuring data Once the relationships and dependencies amongst the various pieces of information have been determined, it is possible to arrange the data into a logical structure which can then be mapped into the storage objects supported by the database management system. In the case of relational databases the storage objects are tables which store data in rows and columns. Each table may represent an implementation of either a logical object or a relationship joining one or more instances of one or more logical objects. Relationships between tables may then be stored as links connecting child tables with parents. Since complex logical relationships are themselves tables they will probably have links to more than one parent. In an Object database the storage objects correspond directly to the objects used by the Object-oriented programming language used to write the applications that will manage and access the data. The relationships may be defined as attributes of the object classes involved or as methods that operate on the object classes. Physical database design The physical design of the database specifies the physical configuration of the database on the storage media. This includes detailed specification of data elements, data types, indexing options and other parameters residing in the DBMS data dictionary. It is the detailed design of a system that includes modules & the database's hardware & software specifications of the system. Database normalization Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency. Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between them. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships. Edgar F. Codd, the inventor of the relational model, introduced the concept of normalization and what we now know as the First Normal Form (1NF) in 1970. Codd went on to define the Second Normal Form (2NF) and Third Normal Form (3NF) in 1971, and Codd and Raymond F. Boyce defined the Boyce-Codd Normal Form (BCNF) in 1974.[3] Higher normal forms were defined by other theorists in subsequent years, the most recent being the Sixth Normal Form (6NF) introduced by Chris Date, Hugh Darwen, and Nikos Lorentzos in 2002.[4] Informally, a relational database table (the computerized representation of a relation) is often described as "normalized" if it is in the Third Normal Form. Most 3NF tables are free of insertion, update, and deletion anomalies, i.e. in most cases 3NF tables adhere to BCNF, 4NF, and 5NF (but typically not 6NF). A standard piece of database design guidance is that the designer should create a fully normalized design; selective denormalization can subsequently be performed for performance reasons. However, some modeling disciplines, such as the dimensional modeling approach to data warehouse design, explicitly recommend non-normalized designs, i.e. designs that in large part do not adhere to 3NF. Objectives of normalization A basic objective of the first normal form defined by Codd in 1970 was to permit data to be queried and manipulated using a "universal data sub-language" grounded in first-order logic.[8] (SQL is an example of such a data sub-language, albeit one that Codd regarded as seriously flawed.)[9]

The objectives of normalization beyond 1NF (First Normal Form) were stated as follows by Codd: 1. To free the collection of relations from undesirable insertion, update and deletion dependencies; 2. To reduce the need for restructuring the collection of relations, as new types of data are introduced, and thus increase the life span of application programs; 3. To make the relational model more informative to users; 4. To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by. E.F. Codd, "Further Normalization of the Data Base Relational Model"[10] The sections below give details of each of these objectives. Free the database of modification anomalies

An update anomaly. Employee 519 is shown as having different addresses on different records.

An insertion anomaly. Until the new faculty member, Dr. Newsome, is assigned to teach at least one course, his details cannot be recorded.

A deletion anomaly. All information about Dr. Giddens is lost when he temporarily ceases to be assigned to any courses. When an attempt is made to modify (update, insert into, or delete from) a table, undesired side-effects may follow. Not all tables can suffer from these side-effects; rather, the side-effects can only arise in tables that have not been sufficiently normalized. An insufficiently normalized table might have one or more of the following characteristics: The same information can be expressed on multiple rows; therefore updates to the table may result in logical inconsistencies. For example, each record in an "Employees' Skills" table might contain an Employee ID, Employee Address, and Skill; thus a change of address for a particular employee will potentially need to be applied to multiple records (one for each of his skills). If the update is not carried through successfullyif, that is, the employee's address is updated on some records but not othersthen the table is left in an inconsistent state. Specifically, the table provides conflicting answers to the question of what this particular employee's address is. This phenomenon is known as an update anomaly. There are circumstances in which certain facts cannot be recorded at all. For example, each record in a "Faculty and Their Courses" table might contain a Faculty ID, Faculty Name, Faculty Hire Date, and Course Codethus we can record the details of any faculty member who teaches at least one course, but we cannot record the details of a newly-hired faculty member who has not yet been assigned to teach any courses except by setting the Course Code to null. This phenomenon is known as an insertion anomaly. There are circumstances in which the deletion of data representing certain facts necessitates the deletion of data representing completely different facts. The "Faculty and Their Courses" table described in the previous example suffers from this type of anomaly, for if a faculty member temporarily ceases to be assigned to any courses, we must delete the last of the records on which that faculty member appears, effectively also deleting the faculty member. This phenomenon is known as a deletion anomaly. Minimize redesign when extending the database structure When a fully normalized database structure is extended to allow it to accommodate new types of data, the pre-existing aspects of the database structure can remain largely or entirely unchanged. As a result, applications interacting with the database are minimally affected. Make the data model more informative to users Normalized tables, and the relationship between one normalized table and another, mirror real-world concepts and their interrelationships. Avoid bias towards any particular pattern of querying

10

Normalized tables are suitable for general-purpose querying. This means any queries against these tables, including future queries whose details cannot be anticipated, are supported. In contrast, tables that are not normalized lend themselves to some types of queries, but not others. For example, consider an online bookseller whose customers maintain wishlists of books they'd like to have. For the obvious, anticipated querywhat books does this customer want?it's enough to store the customer's wishlist in the table as, say, a homogeneous string of authors and titles. With this design, though, the database can answer only that one single query. It cannot by itself answer interesting but unanticipated queries: What is the most-wished-for book? Which customers are interested in WWII espionage? How does Lord Byron stack up against his contemporary poets? Answers to these questions must come from special adaptive tools completely separate from the database. One tool might be software written especially to handle such queries. This special adaptive software has just one single purpose: in effect to normalize the non-normalized field. Unforeseen queries can be answered trivially and entirely within the database framework, with a normalized table. Example Querying and manipulating the data within an unnormalized data structure, such as the following non-1NF representation of customers' credit card transactions, involves more complexity than is really necessary: Customer Transactions Tr. ID Date Amount 12890 14-Oct-2003 87 Jones 12904 15-Oct-2003 50 Wilkins Tr. ID Date Amount 12898 14-Oct-2003 21 Tr. ID 12907 14920 15003 Date 15-Oct-2003 20-Nov-2003 27-Nov-2003 Amount 18 70

Stevens

60

To each customer there corresponds a repeating group of transactions. The automated evaluation of any query relating to customers' transactions therefore would broadly involve two stages: 1. Unpacking one or more customers' groups of transactions allowing the individual transactions in a group to be examined, and 2. Deriving a query result based on the results of the first stage For example, in order to find out the monetary sum of all transactions that occurred in October 2003 for all customers, the system would have to know that it must first unpack the Transactions group of each customer, then sum the Amounts of all transactions thus obtained where the Date of the transaction falls in October 2003. One of Codd's important insights was that this structural complexity could always be removed completely, leading to much greater power and flexibility in the way queries could be formulated (by users and applications) and evaluated (by the DBMS). The normalized equivalent of the structure above would look like this: Customer Tr. ID Date Amount Jones 12890 14-Oct-2003 87 Jones 12904 15-Oct-2003 50 Wilkins 12898 14-Oct-2003 21 Stevens 12907 15-Oct-2003 18 Stevens 14920 20-Nov-2003 70 Stevens 15003 27-Nov-2003 60 Now each row represents an individual credit card transaction, and the DBMS can obtain the answer of interest, simply by finding all rows with a Date falling in October, and summing their Amounts. The data structure places all of the values on an equal footing, exposing each to the DBMS directly, so each can potentially participate directly in queries; whereas in the previous situation some values were embedded in lower-level structures that had to be handled specially. Accordingly, the normalized design lends itself to general-purpose query processing, whereas the unnormalized design does not. Background to normalization: definitions
S. No 1. Terms Functional dependency Explanation In a given table, an attribute Y is said to have a functional dependency on a set of attributes X (written X Y) if and only if each X value is associated with precisely one Y value. For example, in an "Employee" table that includes the attributes "Employee ID" and "Employee Date of Birth", the functional dependency {Employee ID} {Employee Date of Birth} would hold. It follows from the previous two sentences that each {Employee ID} is associated with precisely one {Employee Date of Birth}. A trivial functional dependency is a functional dependency of an attribute on a superset of itself. {Employee ID, Employee Address} {Employee Address} is trivial, as is {Employee Address} {Employee Address}. An attribute is fully functionally dependent on a set of attributes X if it is: functionally dependent on X, and not functionally dependent on any proper subset of X. {Employee Address} has a functional dependency on {Employee ID, Skill}, but not a full functional dependency, because it is also dependent on {Employee ID}. A transitive dependency is an indirect functional dependency, one in which XZ only by virtue of XY and YZ.

2. 3.

Trivial functional dependency Full functional dependency

4.

Transitive dependency

11

5. 6. 7. 8.

Multivalued dependency Join dependency Superkey Candidate key

9. 10. 11.

Non-prime attribute Prime attribute Primary key

A multivalued dependency is a constraint according to which the presence of certain rows in a table implies the presence of certain other rows. A table T is subject to a join dependency if T can always be recreated by joining multiple tables each having a subset of the attributes of T. A superkey is a combination of attributes that can be used to uniquely identify a database record. A table might have many superkeys A candidate key is a special subset of superkeys that do not have any extraneous information in them: it is a minimal superkey Examples: Imagine a table with the fields <Name>, <Age>, <SSN> and <Phone Extension>. This table has many possible superkeys. Three of these are <SSN>, <Phone Extension, Name> and <SSN, Name>. Of those listed, only <SSN> is a candidate key, as the others contain information not necessary to uniquely identify records ('SSN' here refers to Social Security Number, which is unique to each person). A non-prime attribute is an attribute that does not occur in any candidate key. Employee Address would be a non-prime attribute in the "Employees' Skills" table. A prime attribute, conversely, is an attribute that does occur in some candidate key. Most DBMSs require a table to be defined as having a single unique key, rather than a number of possible unique keys. A primary key is a key which the database designer has designated for this purpose.

Normal forms The normal forms (abbrev. NF) of relational database theory provide criteria for determining a table's degree of vulnerability to logical inconsistencies and anomalies. The higher the normal form applicable to a table, the less vulnerable it is to inconsistencies and anomalies. Each table has a "highest normal form" (HNF): by definition, a table always meets the requirements of its HNF and of all normal forms lower than its HNF; also by definition, a table fails to meet the requirements of any normal form higher than its HNF. The normal forms are applicable to individual tables; to say that an entire database is in normal form n is to say that all of its tables are in normal form n. Newcomers to database design sometimes suppose that normalization proceeds in an iterative fashion, i.e. a 1NF design is first normalized to 2NF, then to 3NF, and so on. This is not an accurate description of how normalization typically works. A sensibly designed table is likely to be in 3NF on the first attempt; furthermore, if it is 3NF, it is overwhelmingly likely to have an HNF of 5NF. Achieving the "higher" normal forms (above 3NF) does not usually require an extra expenditure of effort on the part of the designer, because 3NF tables usually need no modification to meet the requirements of these higher normal forms. The main normal forms are summarized below.
Normal form Defined by Brief definition Two versions: E.F. Codd (1970), C.J. Table faithfully represents a relation and has no repeating 1st normal form (1NF) Date (2003)[11] groups No non-prime attribute in the table is functionally dependent 2nd normal form (2NF) E.F. Codd (1971)[2] on a proper subset of any candidate key E.F. Codd (1971);[2] see +also Carlo Every non-prime attribute is non-transitively dependent on 3rd normal form (3NF) Zaniolo's equivalent but differentlyevery candidate key in the table expressed definition (1982)[12] BoyceCodd normal form Raymond F. Boyce and E.F. Codd Every non-trivial functional dependency in the table is a (BCNF) (1974)[13] dependency on a superkey Every non-trivial multivalued dependency in the table is a 4th normal form (4NF) Ronald Fagin (1977)[14] dependency on a superkey Every non-trivial join dependency in the table is implied by 5th normal form (5NF) Ronald Fagin (1979)[15] the superkeys of the table Domain/key normal form Every constraint on the table is a logical consequence of the Ronald Fagin (1981)[16] (DKNF) table's domain constraints and key constraints C.J. Date, Hugh Darwen, and Nikos Table features no non-trivial join dependencies at all (with Sixth normal form (6NF) Lorentzos (2002)[4] reference to generalized join operator)

First normal form First normal form (1NF or Minimal Form) is a normal form used in database normalization. A relational database table that adheres to 1NF is one that meets a certain minimum set of criteria. These criteria are basically concerned with ensuring that the table is a faithful representation of a relation[1] and that it is free of repeating groups.[2] The concept of a "repeating group" is, however, understood in different ways by different theorists. As a consequence, there is no universal agreement as to which features would disqualify a table from being in 1NF. Most notably, 1NF as defined by some authors (for example, Ramez Elmasri and Shamkant B. Navathe,[3] following the precedent established by Edgar F. Codd) excludes relation-valued attributes (tables within tables); whereas 1NF as defined by other authors (for example, Chris Date) permits them. 1NF tables as representations of relations According to Date's definition of 1NF, a table is in 1NF if and only if it is "isomorphic to some relation", which means, specifically, that it satisfies the following five conditions: 1. There's no top-to-bottom ordering to the rows.

12

2. There's no left-to-right ordering to the columns. 3. There are no duplicate rows. 4. Every row-and-column intersection contains exactly one value from the applicable domain (and nothing else). 5. All columns are regular [i.e. rows have no hidden components such as row IDs, object IDs, or hidden timestamps]. Chris Date, "What First Normal Form Really Means", pp. 1278[4] Violation of any of these conditions would mean that the table is not strictly relational, and therefore that it is not in 1NF. Examples of tables (or views) that would not meet this definition of 1NF are: A table that lacks a unique key. Such a table would be able to accommodate duplicate rows, in violation of condition 3. A view whose definition mandates that results be returned in a particular order, so that the row-ordering is an intrinsic and meaningful aspect of the view.[5] This violates condition 1. The tuples in true relations are not ordered with respect to each other. A table with at least one nullable attribute. A nullable attribute would be in violation of condition 4, which requires every field to contain exactly one value from its column's domain. It should be noted, however, that this aspect of condition 4 is controversial. It marks an important departure from Codd's later vision of the relational model,[6] which made explicit provision for nulls.[7] Repeating groups Date's fourth condition, which expresses "what most people think of as the defining feature of 1NF",[8] is concerned with repeating groups. The following scenario illustrates how a database design might incorporate repeating groups, in violation of 1NF. Domains and values Suppose a designer wishes to record the names and telephone numbers of customers. He defines a customer table which looks like this: Customer Customer ID First Name Surname Telephone Number 123 Robert Ingram 555-861-2025 456 Jane Wright 555-403-1659 789 Maria Fernandez 555-808-9633 The designer then becomes aware of a requirement to record multiple telephone numbers for some customers. He reasons that the simplest way of doing this is to allow the "Telephone Number" field in any given record to contain more than one value: Customer Customer ID First Name Surname Telephone Number 123 Robert Ingram 555-861-2025 555-403-1659 456 Jane Wright 555-776-4100 789 Maria Fernandez 555-808-9633 Assuming, however, that the Telephone Number column is defined on some Telephone Number-like domain (e.g. the domain of strings 12 characters in length), the representation above is not in 1NF. 1NF (and, for that matter, the RDBMS) prevents a single field from containing more than one value from its column's domain. Repeating groups across columns The designer might attempt to get around this restriction by defining multiple Telephone Number columns: Customer Customer ID First Name Surname Tel. No. 1 Tel. No. 2 Tel. No. 3 123 Robert Ingram 555-861-2025 456 Jane Wright 555-403-1659 555-776-4100 555-403-1659 789 Maria Fernandez 555-808-9633 This representation, however, makes use of nullable columns, and therefore does not conform to Date's definition of 1NF (in violation to condition 4). Even if the view is taken that nullable columns are allowed, the design is not in keeping with the spirit of 1NF (in violation to condition 2). Tel. No. 1, Tel. No. 2., and Tel. No. 3. share exactly the same domain and exactly the same meaning; the splitting of Telephone Number into three headings is artificial and causes logical problems. These problems include: Difficulty in querying the table. Answering such questions as "Which customers have telephone number X?" and "Which pairs of customers share a telephone number?" is awkward. Inability to enforce uniqueness of Customer-to-Telephone Number links through the RDBMS. Customer 789 might mistakenly be given a Tel. No. 2 value that is exactly the same as her Tel. No. 1 value. Restriction of the number of telephone numbers per customer to three. If a customer with four telephone numbers comes along, we are constrained to record only three and leave the fourth unrecorded. This means that the database design is imposing constraints on the business process, rather than (as should ideally be the case) vice-versa. Repeating groups within columns The designer might, alternatively, retain the single Telephone sufficient length to accommodate multiple telephone numbers: Customer Customer ID First Name Surname 123 Robert Ingram 456 Jane Wright

Number column but alter its domain, making it a string of

Telephone Numbers 555-861-2025 555-403-1659, 555-776-4100

13

789

Maria

Fernandez 555-808-9633

This design is consistent with 1NF, but still presents several design issues. The Telephone Number heading becomes semantically non-specific, as it can now represent either a telephone number, a list of telephone numbers, or indeed anything at all. A query such as "Which pairs of customers share a telephone number?" is more difficult to formulate, given the necessity to cater for lists of telephone numbers as well as individual telephone numbers. Meaningful constraints on telephone numbers are also very difficult to define in the RDBMS with this design. A design that complies with 1NF A design that is unambiguously in 1NF makes use of two tables: a Customer Name table and a Customer Telephone Number table. Customer Name Customer ID First Name Surname 123 Robert Ingram 456 Jane Wright 789 Maria Fernandez Customer Telephone Number Customer ID Telephone Number 123 555-861-2025 456 555-403-1659 456 555-776-4100 789 555-808-9633 Repeating groups of telephone numbers do not occur in this design. Instead, each Customer-to-Telephone Number link appears on its own record. With Customer ID as key fields, a "parent-child" or one-to-many (1:M) relationship exists between the two tables, since a customer record (in the "parent" table, Customer Name) can have many telephone number records (in the "child" table, Customer Telephone Number), but each telephone number usually has one, and only one customer. In the case where several customers could share the same telephone number, an additional column is needed in the Customer Telephone Number table to represent a unique key. It is worth noting that this design meets the additional requirements for second and third normal form (3NF). Atomicity Some definitions of 1NF, most notably that of Edgar F. Codd, make reference to the concept of atomicity. Codd states that the "values in the domains on which each relation is defined are required to be atomic with respect to the DBMS." [9] Codd defines an atomic value as one that "cannot be decomposed into smaller pieces by the DBMS (excluding certain special functions)."[10] Meaning a field should not be divided into parts with more than one kind of data in it such that what one part means to the DBMS depends on another part of the same field. Hugh Darwen and Chris Date have suggested that Codd's concept of an "atomic value" is ambiguous, and that this ambiguity has led to widespread confusion about how 1NF should be understood.[11][12] In particular, the notion of a "value that cannot be decomposed" is problematic, as it would seem to imply that few, if any, data types are atomic: A character string would seem not to be atomic, as the RDBMS typically provides operators to decompose it into substrings. A fixed-point number would seem not to be atomic, as the RDBMS typically provides operators to decompose it into integer and fractional components. Date suggests that "the notion of atomicity has no absolute meaning":[13] a value may be considered atomic for some purposes, but may be considered an assemblage of more basic elements for other purposes. If this position is accepted, 1NF cannot be defined with reference to atomicity. Columns of any conceivable data type (from string types and numeric types to array types and table types) are then acceptable in a 1NF tablealthough perhaps not always desirable (For example, it would be more desirable to separate a Customer Name field into two separate fields as First Name, Surname). Date argues that relation-valued attributes, by means of which a field within a table can contain a table, are useful in rare cases.[14] Normalization beyond 1NF Any table that is in second normal form (2NF) or higher is, by definition, also in 1NF (each normal form has more stringent criteria than its predecessor). On the other hand, a table that is in 1NF may or may not be in 2NF; if it is in 2NF, it may or may not be in 3NF, and so on. Normal forms higher than 1NF are intended to deal with situations in which a table suffers from design problems that may compromise the integrity of the data within it. For example, the following table is in 1NF, but is not in 2NF and therefore is vulnerable to logical inconsistencies: Subscriber Email Addresses Subscriber ID Email Address 108 steve@aardvarkmail.net 252 carol@mongoosemail.org 252 crobertson@aardvarkmail.net 360 hclark@antelopemail.com The table's key is {Subscriber ID, Email Address}.

Subscriber First Name Steve Carol Carol Harriet

Subscriber Surname Wallace Robertson Robertson Clark

If Carol Robertson changes her surname by marriage, the change must be applied to two rows. If the change is only applied to one row, a contradiction results: the question "What is Customer 252's name?" has two conflicting answers. 2NF

14

addresses this problem. Note that Carol Robertson's record is appearing in the table twice because it has more than one email address related to it. A practical way to think of 1NF in the above table is to ask a series of questions about the relationships that records (rows) can have between entities (tables) or attributes (columns), based on given business rules or constraints. For example, could a Subscriber record relate to many Email Address records? Could an Email Address record relate to many Subscriber records? In the above table we can see that Carol Robertson has more than one email address. We could answer the questions by saying there's a one-to-many relationship (1:M) between Subscriber and Email Address in the above table, since a subscriber can have many email addresses, and an email address usually has one, and only one subscriber. We would then create a separate table called Subscribers and move the Subscriber First Name and Subscriber Surname columns from the Subscriber Email Addresses table into the new Subscribers table, adding a third column Subscriber ID as the primary key. Thus, a one-to-many relationship exists between the Subscribers table (with Subscriber ID as the primary key) and the Subscriber Email Addresses table (with Subscriber ID as the foreign key). The tables would conform to 2NF in addition to 1NF. Second normal form Second normal form (2NF) is a normal form used in database normalization. 2NF was originally defined by E.F. Codd in 1971.[1] A table that is in first normal form (1NF) must meet additional criteria if it is to qualify for second normal form. Specifically: a table is in 2NF if and only if, it is in 1NF and no non prime attribute is dependent on any proper subset of any candidate key of the table. A non prime attribute of a table is an attribute that is not a part of any candidate key of the table. Put simply, a table is in 2NF if and only if, it is in 1NF and every non-prime attribute of the table is either dependent on the whole of a candidate key, or on another non prime attribute. Note that when a 1NF table has no composite candidate keys (candidate keys consisting of more than one attribute), the table is automatically in 2NF. Example Consider a table describing employees' skills: Employees' Skills Employee Skill Jones Typing Jones Shorthand Jones Whittling Bravo Light Cleaning Ellis Alchemy Ellis Flying Harrison Light Cleaning

Current Work Location 114 Main Street 114 Main Street 114 Main Street 73 Industrial Way 73 Industrial Way 73 Industrial Way 73 Industrial Way

Neither {Employee} nor {Skill} is a candidate key for the table. This is because a given Employee might need to appear more than once (he might have multiple Skills), and a given Skill might need to appear more than once (it might be possessed by multiple Employees). Only the composite key {Employee, Skill} qualifies as a candidate key for the table. The remaining attribute, Current Work Location, is dependent on only part of the candidate key, namely Employee. Therefore the table is not in 2NF. Note the redundancy in the way Current Work Locations are represented: we are told three times that Jones works at 114 Main Street, and twice that Ellis works at 73 Industrial Way. This redundancy makes the table vulnerable to update anomalies: it is, for example, possible to update Jones' work location on his "Typing" and "Shorthand" records and not update his "Whittling" record. The resulting data would imply contradictory answers to the question "What is Jones' current work location?" A 2NF alternative to this design would represent the same information in two tables: an "Employees" table with candidate key {Employee}, and an "Employees' Skills" table with candidate key {Employee, Skill}: Employees Employee Current Work Location Jones 114 Main Street Bravo 73 Industrial Way Ellis 73 Industrial Way Harrison 73 Industrial Way Employees' Skills Employee Skill Jones Typing Jones Shorthand Jones Whittling Bravo Light Cleaning Ellis Alchemy Ellis Flying Harrison Light Cleaning Neither of these tables can suffer from update anomalies.

15

Not all 2NF tables are free from update anomalies, however. An example of a 2NF table which suffers from update anomalies is: Tournament Winners Tournament Year Des Moines Masters 1998 Indiana Invitational 1998 Cleveland Open 1999 Des Moines Masters 1999 Indiana Invitational 1999

Winner Chip Masterson Al Fredrickson Bob Albertson Al Fredrickson Chip Masterson

Winner Date of Birth 14 March 1977 21 July 1975 28 September 1968 21 July 1975 14 March 1977

Even though Winner and Winner Date of Birth are determined by the whole key {Tournament, Year} and not part of it, particular Winner / Winner Date of Birth combinations are shown redundantly on multiple records. This leads to an update anomaly: if updates are not carried out consistently, a particular winner could be shown as having two different dates of birth. The underlying problem is the transitive dependency to which the Winner Date of Birth attribute is subject. Winner Date of Birth actually depends on Winner, which in turn depends on the key Tournament / Year. This problem is addressed by third normal form (3NF). 2NF and candidate keys A table for which there are no partial functional dependencies on the primary key is typically, but not always, in 2NF. In addition to the primary key, the table may contain other candidate keys; it is necessary to establish that no non-prime attributes have part-key dependencies on any of these candidate keys. Multiple candidate keys occur in the following table: Electric Toothbrush Models Manufacturer Model Model Full Name Forte X-Prime Forte X-Prime Forte Ultraclean Forte Ultraclean Dent-o-Fresh EZbrush Dent-o-Fresh EZBrush Kobayashi ST-60 Kobayashi ST-60 Hoch Toothmaster Hoch Toothmaster Hoch X-Prime Hoch X-Prime

Manufacturer Country Italy Italy USA Japan Germany Germany

Even if the designer has specified the primary key as {Model Full Name}, the table is not in 2NF. {Manufacturer, Model} is also a candidate key, and Manufacturer Country is dependent on a proper subset of it: Manufacturer. To make the design conform to 2NF, it is necessary to have two tables: Electric Toothbrush Manufacturers Manufacturer Manufacturer Country Forte Italy Dent-o-Fresh USA Kobayashi Japan Hoch Germany Electric Toothbrush Models Manufacturer Model Forte X-Prime Forte Ultraclean Dent-o-Fresh EZbrush Kobayashi ST-60 Hoch Toothmaster Hoch X-Prime

Model Full Name Forte X-Prime Forte Ultraclean Dent-o-Fresh EZBrush Kobayashi ST-60 Hoch Toothmaster Hoch X-Prime

Third normal form In computer science, the third normal form (3NF) is a normal form used in database normalization. 3NF was originally defined by E.F. Codd in 1971.[1] Codd's definition states that a table is in 3NF if and only if both of the following conditions hold: The relation R (table) is in second normal form (2NF) Every non-prime attribute of R is non-transitively dependent (i.e. directly dependent) on every super key of R. A non-prime attribute of R is an attribute that does not belong to any candidate key of R.[2] A transitive dependency is a functional dependency in which X Z (X determines Z) indirectly, by virtue of X Y and Y Z (where it is not the case that Y X).[3] A 3NF definition that is equivalent to Codd's, but expressed differently, was given by Carlo Zaniolo in 1982. This definition states that a table is in 3NF if and only if, for each of its functional dependencies X A, at least one of the following conditions holds: X contains A (that is, X A is trivial functional dependency), or X is a superkey, or A-X, the set difference between A and X is a prime attribute (i.e., A-X is contained within a candidate key)

16

Zaniolo's definition gives a clear sense of the difference between 3NF and the more stringent BoyceCodd normal form (BCNF). BCNF simply eliminates the third alternative ("A is a prime attribute").

"Nothing but the key" A memorable statement of Codd's definition of 3NF, paralleling the traditional pledge to give true evidence in a court of law, was given by Bill Kent: "[Every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key."[6] A common variation supplements this definition with the oath: "so help me Codd".[7] Requiring existence of "the key" ensures that the table is in 1NF; requiring that non-key attributes be dependent on "the whole key" ensures 2NF; further requiring that non-key attributes be dependent on "nothing but the key" ensures 3NF. Chris Date refers to Kent's summary as "an intuitively attractive characterization" of 3NF, and notes that with slight adaptation it may serve as a definition of the slightly-stronger BoyceCodd normal form: "Each attribute must represent a fact about the key, the whole key, and nothing but the key."[8] The 3NF version of the definition is weaker than Date's BCNF variation, as the former is concerned only with ensuring that non-key attributes are dependent on keys. Prime attributes (which are keys or parts of keys) must not be functionally dependent at all; they each represent a fact about the key in the sense of providing part or all of the key itself. (It should be noted here that this rule applies only to functionally dependent attributes, as applying it to all attributes would implicitly prohibit composite candidate keys, since each part of any such key would violate the "whole key" clause.) An example of a 2NF table that fails to meet the requirements of 3NF is: Tournament Winners Tournament Year Indiana Invitational 1998 Cleveland Open 1999 Des Moines Masters 1999 Indiana Invitational 1999

Winner Al Fredrickson Bob Albertson Al Fredrickson Chip Masterson

Winner Date of Birth 21 July 1975 28 September 1968 21 July 1975 14 March 1977

Because each row in the table needs to tell us who won a particular Tournament in a particular Year, the composite key {Tournament, Year} is a minimal set of attributes guaranteed to uniquely identify a row. That is, {Tournament, Year} is a candidate key for the table. The breach of 3NF occurs because the non-prime attribute Winner Date of Birth is transitively dependent on the candidate key {Tournament, Year} via the non-prime attribute Winner. The fact that Winner Date of Birth is functionally dependent on Winner makes the table vulnerable to logical inconsistencies, as there is nothing to stop the same person from being shown with different dates of birth on different records. In order to express the same facts without violating 3NF, it is necessary to split the table into two: Tournament Winners Tournament Year Indiana Invitational 1998 Cleveland Open 1999 Des Moines Masters 1999 Indiana Invitational 1999

Winner Al Fredrickson Bob Albertson Al Fredrickson Chip Masterson

Player Dates of Birth Player Date of Birth Chip Masterson 14 March 1977 Al Fredrickson 21 July 1975 Bob Albertson 28 September 1968 Update anomalies cannot occur in these tables, which are both in 3NF. Derivation of Zaniolo's conditions The definition of 3NF offered by Carlo Zaniolo in 1982, and given above, is proved in the following way: Let X A be a nontrivial FD (i.e. one where X does not contain A) and let A be a non-key attribute. Also let Y be a key of R. Then Y X. Therefore A is not transitively dependent on Y if and only if X Y, that is, if and only if X is a superkey.[9] Normalization beyond 3NF Most 3NF tables are free of update, insertion, and deletion anomalies. Certain types of 3NF tables, rarely met with in practice, are affected by such anomalies; these are tables which either fall short of BoyceCodd normal form (BCNF) or, if they meet BCNF, fall short of the higher normal forms 4NF or 5NF. BoyceCodd normal form BoyceCodd normal form (or BCNF or 3.5NF) is a normal form used in database normalization. It is a slightly stronger version of the third normal form (3NF). BCNF was developed in 1974 by Raymond F. Boyce and Edgar F. Codd to address certain types of anomaly not dealt with by 3NF as originally defined.[1] Chris Date has pointed out that a definition of what we now know as BCNF appeared in a paper by Ian Heath in 1971. [2] Date writes: "Since that definition predated Boyce and Codd's own definition by some three years, it seems to me that BCNF ought by rights to be called Heath normal form. But it isn't."[3]

17

If a relational scheme is in BCNF then all redundancy based on functional dependency has been removed, although other types of redundancy may still exist. A relational schema R is in BoyceCodd normal form if and only if for every one of its nontrivial dependencies X Y, at least one of the following conditions holds:[4]

X Y is a trivial functional dependency X is a superkey for schema R

3NF tables not meeting BCNF Only in rare cases does a 3NF table not meet the requirements of BCNF. A 3NF table which does not have multiple overlapping candidate keys is guaranteed to be in BCNF.[5] Depending on what its functional dependencies are, a 3NF table with two or more overlapping candidate keys may or may not be in BCNF An example of a 3NF table that does not meet BCNF is: Today's Court Bookings Court Start Time End Time Rate Type 1 09:30 10:30 SAVER 1 11:00 12:00 SAVER 1 14:00 15:30 STANDARD 2 10:00 11:30 PREMIUM-B 2 11:30 13:30 PREMIUM-B 2 15:00 16:30 PREMIUM-A Each row in the table represents a court booking at a tennis club that has one hard court (Court 1) and one grass court (Court 2) A booking is defined by its Court and the period for which the Court is reserved Additionally, each booking has a Rate Type associated with it. There are four distinct rate types: SAVER, for Court 1 bookings made by members STANDARD, for Court 1 bookings made by non-members PREMIUM-A, for Court 2 bookings made by members PREMIUM-B, for Court 2 bookings made by non-members

The table's superkeys are: S1 = {Court, Start Time} S2 = {Court, End Time} S3 = {Rate Type, Start Time} S4 = {Rate Type, End Time} S5 = {Court, Start Time, End Time} S6 = {Rate Type, Start Time, End Time} S7 = {Court, Rate Type, Start Time} S8 = {Court, Rate Type, End Time} ST = {Court, Rate Type, Start Time, End Time}, the trivial superkey Note that even though in the above table Start Time and End Time attributes have no duplicate values for each of them, we still have to admit that in some other days two different bookings on court 1 and court 2 could start at the same time or end at the same time. This is the reason why {Start Time} and {End Time} cannot be considered as the table's superkeys. However, only S1, S2, S3 and S4 are candidate keys (that is, minimal superkeys for that relation) because e.g. S 1 S5, so S5 cannot be a candidate key. Recall that 2NF prohibits partial functional dependencies of non-prime attributes (i.e. an attribute that does not occur in ANY candidate key) on candidate keys, and that 3NF prohibits transitive functional dependencies of non-prime attributes on candidate keys. In Today's Court Bookings table, there are no non-prime attributes: that is, all attributes belong to some candidate key. Therefore the table adheres to both 2NF and 3NF. The table does not adhere to BCNF. This is because of the dependency Rate Type Court, in which the determining attribute (Rate Type) is neither a candidate key nor a superset of a candidate key. Dependency Rate Type Court is respected as a Rate Type should only ever apply to a single Court. The design can be amended so that it meets BCNF: Rate Types Rate Type SAVER STANDARD PREMIUM-A PREMIUM-B

Court 1 1 2 2

Member Flag Yes No Yes No

Today's Bookings Rate Type Start Time SAVER 09:30 SAVER 11:00 STANDARD 14:00 PREMIUM-B 10:00 PREMIUM-B 11:30

End Time 10:30 12:00 15:30 11:30 13:30

18

PREMIUM-A 15:00

16:30

The candidate keys for the Rate Types table are {Rate Type} and {Court, Member Flag}; the candidate keys for the Today's Bookings table are {Rate Type, Start Time} and {Rate Type, End Time}. Both tables are in BCNF. Having one Rate Type associated with two different Courts is now impossible, so the anomaly affecting the original table has been eliminated. Achievability of BCNF In some cases, a non-BCNF table cannot be decomposed into tables that satisfy BCNF and preserve the dependencies that held in the original table. Beeri and Bernstein showed in 1979 that, for example, a set of functional dependencies {AB C, C B} cannot be represented by a BCNF schema. [6] Thus, unlike the first three normal forms, BCNF is not always achievable. Consider the following non-BCNF table whose functional dependencies follow the {AB C, C B} pattern: Nearest Shops Person Shop Type Davidson Optician Davidson Hairdresser Wright Bookshop Fuller Bakery Fuller Hairdresser Fuller Optician

Nearest Shop Eagle Eye Snippets Merlin Books Doughy's Sweeney Todd's Eagle Eye

For each Person / Shop Type combination, the table tells us which shop of this type is geographically nearest to the person's home. We assume for simplicity that a single shop cannot be of more than one type. The candidate keys of the table are: {Person, Shop Type} {Person, Nearest Shop} Because all three attributes are prime attributes (i.e. belong to candidate keys), the table is in 3NF. The table is not in BCNF, however, as the Shop Type attribute is functionally dependent on a non-superkey: Nearest Shop. The violation of BCNF means that the table is subject to anomalies. For example, Eagle Eye might have its Shop Type changed to "Optometrist" on its "Fuller" record while retaining the Shop Type "Optician" on its "Davidson" record. This would imply contradictory answers to the question: "What is Eagle Eye's Shop Type?" Holding each shop's Shop Type only once would seem preferable, as doing so would prevent such anomalies from occurring: Shop Near Person Person Shop Davidson Eagle Eye Davidson Snippets Wright Merlin Books Fuller Doughy's Fuller Sweeney Todd's Fuller Eagle Eye Shop Shop Eagle Eye Snippets Merlin Books Doughy's Sweeney Todd's

Shop Type Optician Hairdresser Bookshop Bakery Hairdresser

In this revised design, the "Shop Near Person" table has a candidate key of {Person, Shop}, and the "Shop" table has a candidate key of {Shop}. Unfortunately, although this design adheres to BCNF, it is unacceptable on different grounds: it allows us to record multiple shops of the same type against the same person. In other words, its candidate keys do not guarantee that the functional dependency {Person, Shop Type} {Shop} will be respected. A design that eliminates all of these anomalies (but does not conform to BCNF) is possible.[7] This design consists of the original "Nearest Shops" table supplemented by the "Shop" table described above. Nearest Shops Person Shop Type Davidson Optician Davidson Hairdresser Wright Bookshop Fuller Bakery Fuller Hairdresser Fuller Optician Shop Shop

Nearest Shop Eagle Eye Snippets Merlin Books Doughy's Sweeney Todd's Eagle Eye

Shop Type

19

Eagle Eye Snippets Merlin Books Doughy's Sweeney Todd's

Optician Hairdresser Bookshop Bakery Hairdresser

If a referential integrity constraint is defined to the effect that {Shop Type, Nearest Shop} from the first table must refer to a {Shop Type, Shop} from the second table, then the data anomalies described previously are prevented. Fourth normal form Fourth normal form (4NF) is a normal form used in database normalization. Introduced by Ronald Fagin in 1977, 4NF is the next level of normalization after BoyceCodd normal form (BCNF). Whereas the second, third, and BoyceCodd normal forms are concerned with functional dependencies, 4NF is concerned with a more general type of dependency known as a multivalued dependency. A Table is in 4NF if and only if, for every one of its non-trivial multivalued dependencies X Y, X is a superkeythat is, X is either a candidate key or a superset thereof.[1] Multivalued dependencies If the column headings in a relational database table are divided into three disjoint groupings X, Y, and Z, then, in the context of a particular row, we can refer to the data beneath each group of headings as x, y, and z respectively. A multivalued dependency X Y signifies that if we choose any x actually occurring in the table (call this choice xc), and compile a list of all the xcyz combinations that occur in the table, we will find that xc is associated with the same y entries regardless of z. A trivial multivalued dependency X Y is one where either Y is a subset of X, or X and Y together form the whole set of attributes of the relation. A functional dependency is a special case of multivalued dependency. In a functional dependency X Y, every x determines exactly one y, never more than one. Example Consider the following example: Pizza Delivery Permutations Restaurant Pizza Variety Delivery Area A1 Pizza Thick Crust Springfield A1 Pizza Thick Crust Shelbyville A1 Pizza Thick Crust Capital City A1 Pizza Stuffed Crust Springfield A1 Pizza Stuffed Crust Shelbyville A1 Pizza Stuffed Crust Capital City Elite Pizza Thin Crust Capital City Elite Pizza Stuffed Crust Capital City Vincenzo's Pizza Thick Crust Springfield Vincenzo's Pizza Thick Crust Shelbyville Vincenzo's Pizza Thin Crust Springfield Vincenzo's Pizza Thin Crust Shelbyville Each row indicates that a given restaurant can deliver a given variety of pizza to a given area. The table has no non-key attributes because its only key is {Restaurant, Pizza Variety, Delivery Area}. Therefore it meets all normal forms up to BCNF. If we assume, however, that pizza varieties offered by a restaurant are not affected by delivery area, then it does not meet 4NF. The problem is that the table features two non-trivial multivalued dependencies on the {Restaurant} attribute (which is not a superkey). The dependencies are: {Restaurant} {Pizza Variety} {Restaurant} {Delivery Area} These non-trivial multivalued dependencies on a non-superkey reflect the fact that the varieties of pizza a restaurant offers are independent from the areas to which the restaurant delivers. This state of affairs leads to redundancy in the table: for example, we are told three times that A1 Pizza offers Stuffed Crust, and if A1 Pizza starts producing Cheese Crust pizzas then we will need to add multiple rows, one for each of A1 Pizza's delivery areas. There is, moreover, nothing to prevent us from doing this incorrectly: we might add Cheese Crust rows for all but one of A1 Pizza's delivery areas, thereby failing to respect the multivalued dependency {Restaurant} {Pizza Variety}. To eliminate the possibility of these anomalies, we must place the facts about varieties offered into a different table from the facts about delivery areas, yielding two tables that are both in 4NF: Varieties By Restaurant Restaurant Pizza Variety A1 Pizza Thick Crust A1 Pizza Stuffed Crust Elite Pizza Thin Crust Elite Pizza Stuffed Crust Vincenzo's Pizza Thick Crust Vincenzo's Pizza Thin Crust

20

Delivery Areas By Restaurant Restaurant Delivery Area A1 Pizza Springfield A1 Pizza Shelbyville A1 Pizza Capital City Elite Pizza Capital City Vincenzo's Pizza Springfield Vincenzo's Pizza Shelbyville In contrast, if the pizza varieties offered by a restaurant sometimes did legitimately vary from one delivery area to another, the original three-column table would satisfy 4NF. Ronald Fagin demonstrated that it is always possible to achieve 4NF.[2] Rissanen's theorem is also applicable on multivalued dependencies. 4NF in practice A 1992 paper by Margaret S. Wu notes that the teaching of database normalization typically stops short of 4NF, perhaps because of a belief that tables violating 4NF (but meeting all lower normal forms) are rarely encountered in business applications. This belief may not be accurate, however. Wu reports that in a study of forty organizational databases, over 20% contained one or more tables that violated 4NF while meeting all lower normal forms.[3] Fifth normal form Fifth normal form (5NF), also known as Project-join normal form (PJ/NF) is a level of database normalization designed to reduce redundancy in relational databases recording multi-valued facts by isolating semantically related multiple relationships. A table is said to be in the 5NF if and only if every join dependency in it is implied by the candidate keys. A join dependency *{A, B, Z} on R is implied by the candidate key(s) of R if and only if each of A, B, , Z is a superkey for R.[1] Example Travelling Salesman Product Availability By Brand Travelling Salesman Brand Product Type Jack Schneider Acme Vacuum Cleaner Jack Schneider Acme Breadbox Willy Loman Robusto Pruning Shears Willy Loman Robusto Vacuum Cleaner Willy Loman Robusto Breadbox Willy Loman Robusto Umbrella Stand Louis Ferguson Robusto Vacuum Cleaner Louis Ferguson Robusto Telescope Louis Ferguson Acme Vacuum Cleaner Louis Ferguson Acme Lava Lamp Louis Ferguson Nimbus Tie Rack The table's predicate is: Products of the type designated by Product Type, made by the brand designated by Brand, are available from the travelling salesman designated by Travelling Salesman. In the absence of any rules restricting the valid possible combinations of Travelling Salesman, Brand, and Product Type, the three-attribute table above is necessary in order to model the situation correctly. Suppose, however, that the following rule applies: A Travelling Salesman has certain Brands and certain Product Types in his repertoire. If Brand B is in his repertoire, and Product Type P is in his repertoire, then (assuming Brand B makes Product Type P), the Travelling Salesman must offer products of Product Type P made by Brand B.In that case, it is possible to split the table into three: Product Types By Travelling Salesman Travelling Salesman Product Type Jack Schneider Vacuum Cleaner Jack Schneider Breadbox Willy Loman Pruning Shears Willy Loman Vacuum Cleaner Willy Loman Breadbox Willy Loman Umbrella Stand Louis Ferguson Telescope Louis Ferguson Vacuum Cleaner Louis Ferguson Lava Lamp Louis Ferguson Tie Rack Brands By Travelling Salesman Travelling Salesman Brand Jack Schneider Acme Willy Loman Robusto Louis Ferguson Robusto Louis Ferguson Acme

21

Louis Ferguson

Nimbus

Product Types By Brand Brand Product Type Acme Vacuum Cleaner Acme Breadbox Acme Lava Lamp Robusto Pruning Shears Robusto Vacuum Cleaner Robusto Breadbox Robusto Umbrella Stand Robusto Telescope Nimbus Tie Rack Note how this setup helps to remove redundancy. Suppose that Jack Schneider starts selling Robusto's products. In the previous setup we would have to add two new entries since Jack Schneider is able to sell two Product Types covered by Robusto: Breadboxes and Vacuum Cleaners. With the new setup we need only add a single entry (in Brands By Travelling Salesman). Usage Only in rare situations does a 4NF table not conform to 5NF. These are situations in which a complex real-world constraint governing the valid combinations of attribute values in the 4NF table is not implicit in the structure of that table. If such a table is not normalized to 5NF, the burden of maintaining the logical consistency of the data within the table must be carried partly by the application responsible for insertions, deletions, and updates to it; and there is a heightened risk that the data within the table will become inconsistent. In contrast, the 5NF design excludes the possibility of such inconsistencies. Sixth normal form Sixth normal form (6NF) is a term in relational database theory, used in two different ways. 6NF (C. Date's definition) A book by Christopher J. Date and others on temporal databases,[1] defined sixth normal form as a normal form for databases based on an extension of the relational algebra. In this work, the relational operators, such as join, are generalized to support a natural treatment of interval data, such as sequences of dates or moments in time.[2] Sixth normal form is then based on this generalized join, as follows: A relvar R [table] is in sixth normal form (abbreviated 6NF) if and only if it satisfies no nontrivial join dependencies at all where, as before, a join dependency is trivial if and only if at least one of the projections (possibly U_projections) involved is taken over the set of all attributes of the relvar [table] concerned.[Date et al.][3] Any relation in 6NF is also in 5NF. Sixth normal form is intended to decompose relation variables to irreducible components. Though this may be relatively unimportant for non-temporal relation variables, it can be important when dealing with temporal variables or other interval data. For instance, if a relation comprises a supplier's name, status, and city, we may also want to add temporal data, such as the time during which these values are, or were, valid (e.g., for historical data) but the three values may vary independently of each other and at different rates. We may, for instance, wish to trace the history of changes to Status. For further discussion on Temporal Aggregation in SQL, see also Zimanyi.[4] For a different approach, DKNF Some authors use the term sixth normal form differently, namely, as a synonym for Domain/key normal form (DKNF). This usage predates Date et al.'s work.[6] Usage The sixth normal form is currently being used in some data warehouses where the benefits outweigh the drawbacks,[7] for example using Anchor Modeling. Although using 6NF leads to an explosion of tables, modern databases can prune the tables from select queries (using a process called 'table elimination') where they are not required and thus speed up queries that only access several attributes. Denormalization Databases intended for online transaction processing (OLTP) are typically more normalized than databases intended for online analytical processing (OLAP). OLTP applications are characterized by a high volume of small transactions such as updating a sales record at a supermarket checkout counter. The expectation is that each transaction will leave the database in a consistent state. By contrast, databases intended for OLAP operations are primarily "read mostly" databases. OLAP applications tend to extract historical data that has accumulated over a long period of time. For such databases, redundant or "denormalized" data may facilitate business intelligence applications. Specifically, dimensional tables in a star schema often contain denormalized data. The denormalized or redundant data must be carefully controlled during extract, transform, load (ETL) processing, and users should not be permitted to see the data until it is in a consistent state. The normalized alternative to the star schema is the snowflake schema. In many cases, the need for denormalization has waned as computers and RDBMS software have become more powerful, but since data volumes have generally increased along with hardware and software performance, OLAP databases often still use denormalized schemas. Denormalization is also used to improve performance on smaller computers as in computerized cash-registers and mobile devices, since these may use the data for look-up only (e.g. price lookups). Denormalization may also be used when no RDBMS exists for a platform (such as Palm), or no changes are to be made to the data and a swift response is crucial.

22

Primitive data type In computer science, primitive data type is either of the following:[citation needed] a basic type is a data type provided by a programming language as a basic building block. Most languages allow more complicated composite types to be recursively constructed starting from basic types. a built-in type is a data type for which the programming language provides built-in support. In most programming languages, all basic data types are built-in. In addition, many languages also provide a set of composite data types. Opinions vary as to whether a built-in type that is not basic should be considered "primitive". Depending on the language and its implementation, primitive data types may or may not have a one-to-one correspondence with objects in the computer's memory. However, one usually expects operations on basic primitive data types to be the fastest language constructs there are.[citation needed] Integer addition, for example, can be performed as a single machine instruction, and some processors offer specific instructions to process sequences of characters with a single instruction. In particular, the C standard mentions that "a 'plain' int object has the natural size suggested by the architecture of the execution environment". This means that int is likely to be 32 bits long on a 32-bit architecture. Basic primitive types are almost always Value types Most languages do not allow the behavior or capabilities of primitive (either built-in or basic) data types to be modified by programs. Exceptions include Smalltalk, which permits all data types to be extended within a program, adding to the operations that can be performed on them or even redefining the built-in operations. Overview The actual range of primitive data types that is available is dependent upon the specific programming language that is being used. For example, in C, strings are a composite but built-in data type, whereas in modern dialects of BASIC and in JavaScript, they are assimilated to a primitive data type that is both basic and built-in. Classic basic primitive types may include:

Character (character, char); Integer (integer, int, short, long, byte) with a variety of precisions; Floating-point number (float, double, real, double precision); Fixed-point number (fixed) with a variety of precisions and a programmer-selected scale. Boolean, logical values true and false. Reference (also called a pointer or handle), a small value referring to another object's address in memory, possibly a much larger one.

More sophisticated types which can be built-in include: Tuples in ML, Python Linked lists in Lisp Complex numbers in Fortran, C (C99), Lisp, Python, Perl 6, D Rational numbers in Lisp, Perl 6 Hash tables in various guises, in Lisp, Perl, Python, Lua, D First-class functions, closures, continuations in languages that support functional programming such as Lisp, ML, Perl 6, D and C# 3.0 Specific primitive data types Integer numbers An integer data type can hold a whole number, but no fraction. Integers may be either signed (allowing negative values) or unsigned (nonnegative values only). Typical sizes of integers are: Size Names Signed Range Unsigned Range 8 bits Byte 128 to +127 0 to 255 16 bits Word, short int 32,768 to +32,767 0 to 65,535 Double Word, long int (win32, 32 bits 2,147,483,648 to +2,147,483,647 0 to 4,294,967,295 win64, 32-bit Linux[1]) [1] long int (C in 64-bit linux ), long 9,223,372,036,854,775,808 to 64 bits long (C), long (Java, the signed 0 to 18,446,744,073,709,551,615 +9,223,372,036,854,775,807 [2] integer variant only )) unlimited Bignum Literals for integers consist of a sequence of digits. Most programming languages disallow use of commas for digit grouping, although Fortran (77, 90, and above, fixed form source but not free form source) allows embedded spaces, and Perl, Ruby, and D allow embedded underscores. Negation is indicated by a minus sign () before the value. Examples of integer literals are: 42 10000 233000 Booleans A boolean type, typically denoted "bool" or "boolean", is typically a logical type that can be either "true" or "false". Although only one bit is necessary to accommodate the value set "true" and "false", programming languages typically implement boolean types as one or more bytes.

23

Most languages (Java, Pascal and Ada, e.g.) implement booleans adhering to the concept of boolean as a distinct logical type. Languages, though, may implicitly convert booleans to numeric types at times to give extended semantics to booleans and boolean expressions or to achieve backwards compatibility with earlier versions of the language. In C++, e,g., boolean values may be implicitly converted to integers, according to the mapping false 0 and true 1 (for example, true + true would be a valid expression evaluating to 2). The boolean type bool in C++ is considered an integral type and is a cross between numeric type and a logical type. Floating-point numbers A floating-point number represents a limited-precision rational number that may have a fractional part. These numbers are stored internally in a format equivalent to scientific notation, typically in binary but sometimes in decimal. Because floating-point numbers have limited precision, only a subset of real or rational numbers are exactly representable; other numbers can be represented only approximately. Many languages have both a single precision (often called "float") and a double precision type. Literals for floating point numbers include a decimal point, and typically use "e" or "E" to denote scientific notation. Examples of floating-point literals are: 20.0005 99.9 5000.12 6.02e23 Some languages (e.g., Fortran, Python, D) also have a complex number type comprising two floating-point numbers: a real part and an imaginary part. Fixed-point numbers A fixed-point number represents a limited-precision rational number that may have a fractional part. These numbers are stored internally in a scaled-integer form, typically in binary but sometimes in decimal. Because fixed-point numbers have limited precision, only a subset of real or rational numbers are exactly representable; other numbers can be represented only approximately. Fixed-point numbers also tend to have a more limited range of values than floating point, and so the programmer must be careful to avoid overflow in intermediate calculations as well as the final results. Characters and strings A character type (typically called "char") may contain a single letter, digit, punctuation mark, symbol, formatting code, control code, or some other specialized code (e.g., a byte order mark). Some languages have two or more character types, for example a single-byte type for ASCII characters and a multi-byte type for Unicode characters. The term "character type" is normally used even for types whose values more precisely represent code units, for example a UTF-16 code unit as in Java and JavaScript. Characters may be combined into strings. The string data can include numbers and other numerical symbols but will be treated as text. In most languages, a string is equivalent to an array of characters or code units, but Java treats them as distinct types (java.lang.String and char[]). Other languages (such as Python, and many dialects of BASIC) have no separate character type; strings with a length of one are normally used to represent (single code unit) characters. Literals for characters and strings are usually surrounded by quotation marks: sometimes, single quotes (') are used for characters and double quotes (") are used for strings. Examples of character literals in C syntax are: 'A' '4' '$' '\t' (tab character) Examples of string literals in C syntax are: "A" "Hello World" Numeric data type ranges Each numeric data type has its maximum and minimum value known as the range. Attempting to store a number outside the range may lead to compiler/runtime errors, or to incorrect calculations (due to truncation) depending on the language being used. The range of a variable is based on the number of bytes used to save the value, and an integer data type is usually able to store values (where is the number of bits that contribute to the value). For other data types (e.g. floating point values) the range is more complicated and will vary depending on the method used to store it. There are also some types that do not use entire bytes, e.g. a boolean that requires a single bit, and represents a binary value (although in practice a byte is often used, with the remaining 7 bits being redundant). Some programming languages (such as Ada and Pascal) also allow the opposite direction, that is, the programmer defines the range and precision needed to solve a given problem and the compiler chooses the most appropriate integer or floating point type automatically. Composite data type In computer science, a composite data type is any data type which can be constructed in a program using its programming language's primitive data types and other composite types. The act of constructing a composite type is known as composition. C/C++ structures and classes For more details on this topic, see C++ classes.

24

A struct is C's and C++'s notion of a composite type, a datatype that composes a fixed set of labeled fields or members. It is so called because of the struct keyword used in declaring them, which is short for structure or, more precisely, user-defined data structure. In C++, the only difference between a struct and a class is the default access level, which is private for classes and public for structs. Note that while classes and the class keyword were completely new in C++, the C programming language already had a crude type of structs. For all intents and purposes, C++ structs form a superset of C structs: virtually all valid C structs are valid C++ structs with the same semantics. Declaration A struct declaration consists of a list of fields, each of which can have any type. The total storage required for a struct object is the sum of the storage requirements of all the fields, plus any internal padding. For example: struct Account { int account_number; char *first_name; char *last_name; float balance; }; defines a type, referred to as struct Account. To create a new variable of this type, we can write struct Account myAccount; which has an integer component, accessed by myAccount.account_number, and a floating-point component, accessed by myAccount.balance, as well as the first_name and last_name components. The structure myAccount contains all four values, and all four fields may be changed independently. Since writing struct Account repeatedly in code becomes cumbersome, it is not unusual to see a typedef statement in C code to provide a more convenient synonym for the struct. For example: typedef struct Account_ { int account_number; char *first_name; char *last_name; float balance; } Account; In C++ code, the typedef is not needed because types defined using struct are already part of the regular namespace, so the type can be referred to as either struct Account or simply Account. As another example, a three-dimensional Vector composite type that uses the floating point data type could be created with: struct Vector { float x; float y; float z; }; A variable named velocity with a Vector composite type would be declared as Vector velocity; Members of the velocity would be accessed using a dot notation. For example, velocity.x = 5; would set the x component of velocity equal to 5. Likewise, a color structure could be created using: struct Color { unsigned int red; unsigned int green; unsigned int blue; }; In 3D graphics, you usually must keep track of both the position and color of each vertex. One way to do this would be to create a Vertex composite type, using the previously created Vector and Color composite types: struct Vertex { Vector position; Color color; }; Instantiation Create a variable of type Vertex using the same format as before: Vertex v; Member access Assign values to the components of v like so: v.position.x = 0.0; v.position.y = 1.5; v.position.z = 0.0; v.color.red = 128; v.color.green = 0; v.color.blue = 255; Primitive subtype The primary use of struct is for the construction of complex datatypes, but sometimes it is used to create primitive structural subtyping. For example, since Standard C requires that if two structs have the same initial fields, those fields will be represented in the same way, the code struct ifoo_old_stub { long x, y; };

25

struct ifoo_version_42 { long x, y, z; char *name; long a, b, c; }; void operate_on_ifoo(struct ifoo_old_stub *); struct ifoo_version_42 s; ... operate_on_ifoo(&s); will work correctly. Function types Function types (or type signatures) are constructed from primitive and composite types, and can serve as types themselves when constructing composite types: typedef struct { int x; int y; } Point; typedef double (*Metric) (Point p1, Point p2); typedef struct { Point centre; double radius; Metric metric; } Circle; Physical and logical database Logical & Physical Data Model A logical data model (LDM) in systems engineering is a representation of an organization's data, organized in terms of entities and relationships and is independent of any particular data management technology. Overview Logical data models represent the abstract structure of some domain of information. They are often diagrammatic in nature and are most typically used in business processes that seek to capture things of importance to an organization and how they relate to one another. Once validated and approved, the logical data model can become the basis of a physical data model and inform the design of a database. Logical data models should be based on the structures identified in a preceding conceptual data model, since this describes the semantics of the information context, which the logical model should also reflect. Even so, since the logical data model anticipates implementation on a specific computing system, the content of the logical data model is adjusted to achieve certain efficiencies. The term 'Logical Data Model' is sometimes used as a synonym of 'Domain Model' or as an alternative to the domain model. While the two concepts are closely related, and have overlapping goals, a domain model is more focused on capturing the concepts in the problem domain rather than the structure of the data associated with that domain. A logical data model is sometimes incorrectly called a physical data model, which is not what the ANSI people had in mind. The physical design of a database involves deep use of particular database management technology. For example, a table/column design could be implemented on a collection of computers, located in different parts of the world. That is the domain of the physical model. People often get confused with the fact that logical and physical data models are very different in their objectives, goals and content. Following are some key differences. Logical Data Model Physical Data Model
Includes entities (tables), attributes (columns/fields) and relationships (keys) Includes tables, columns, keys, data types, validation rules, database triggers, stored procedures, domains, and access constraints

Uses more defined and less generic specific names for tables and columns, such as Uses business names for entities abbreviated column names, limited by the database management system (DBMS) and any & attributes company defined standards Is independent of technology (platform, DBMS) Is normalized to fourth normal form(4NF) Includes primary keys and indices for fast data access. May be de-normalized to meet performance requirements based on the nature of the database. If the nature of the database is Online Transaction Processing(OLTP) or Operational Data Store (ODS) it is usually not de-normalized. De-normalization is common in Datawarehouses.

Physical database design The physical design of the database specifies the physical configuration of the database on the storage media. This includes detailed specification of data elements, data types, indexing options and other parameters residing in the DBMS data dictionary. It is the detailed design of a system that includes modules & the database's hardware & software specifications of the system. Data Abstraction

26

Abstraction is used to manage complexity. Software developers use abstraction to decompose complex systems into smaller components. As development progress, programmers know the functionality they can expect from as yet undeveloped subsystems. Thus, programmers are not burdened by considering the ways in which the implementation of later subsystems will affect the design of earlier development. The best definition of abstraction Ive ever read is: An abstraction denotes the essential characteristics of an object that distinguish it from all other kinds of object and thus provide crisply defined conceptual boundaries, relative to the perspective of the viewer. 1. The major purpose of a database system is to provide users with an abstract view of the system. The system hides certain details of how data is stored and created and maintained Complexity should be hidden from database users. 2. There are several levels of abstraction: 1. Physical Level: How the data are stored. E.g. index, B-tree, hashing. Lowest level of abstraction. Complex low-level structures described in detail. 2. Conceptual Level: Next highest level of abstraction. Describes what data are stored. Describes the relationships among data. Database administrator level. 3. View Level: Highest level. Describes part of the database for a particular group of users. Can be many different views of a database. E.g. tellers in a bank get a view of customer accounts, but not of payroll data. Fig. 1.1 (figure 1.1 in the text) illustrates the three levels.

Figure 1.1: The three levels of data abstraction Data independence Data independence is the type of data transparency that matters for a centralized DBMS. It refers to the immunity of user applications to make changes in the definition and organization of data. Physical data independence deals with hiding the details of the storage structure from user applications. The application should not be involved with these issues, since there is no difference in the operation carried out against the data. The data independence and operation independence together gives the feature of data abstraction. There are two levels of data independence. First level The logical structure of the data is known as the schema definition. In general, if a user application operates on a subset of the attributes of a relation, it should not be affected later when new attributes are added to the same relation. Logical data independence indicates that the conceptual schema can be changed without affecting the existing schemas. Second level The physical structure of the data is referred to as "physical data description". Physical data independence deals with hiding the details of the storage structure from user applications. The application should not be involved with these issues since, conceptually; there is no difference in the operations carried out against the data. There are two types of data independence: 1. Logical data independence: The ability to change the logical (conceptual) schema without changing the External schema (User View) is called logical data independence. For example, the addition or removal of new entities, attributes, or relationships to the conceptual schema should be possible without having to change existing external schemas or having to rewrite existing application programs. 2. Physical data independence: The ability to change the physical schema without changing the logical schema is called physical data independence. For example, a change to the internal schema, such as using different file organization or storage structures, storage devices, or indexing strategy, should be possible without having to change the conceptual or external schemas. 3. View level data independence: always independent no effect, because there doesn't exist any other level above view level. Data Independence Data independence can be explained as follows: Each higher level of the data architecture is immune to changes of the next lower level of the architecture. Data independence has two types:

27

1. Physical Independence: The logical scheme stays unchanged even though the storage space or type of some data is
changed for reasons of optimization or reorganization. In this external schema does not change. In these internal schemas changes may be required due to some physical schema were reorganized here. Physical data independence is present in most databases and file environment in which hardware storage of encoding, exact location of data on disk, merging of records, so on this are hidden from user. 2. Logical Independence: The external scheme may stay unchanged for most changes of the logical scheme. This is especially desirable as the application software does not need to be modified or newly translated. Data aggregation Print A AA AAA LinkedIn Facebook Twitter Share This Reprints Data aggregation is any process in which information is gathered and expressed in a summary form, for purposes such as statistical analysis. A common aggregation purpose is to get more information about particular groups based on specific variables such as age, profession, or income. The information about such groups can then be used for Web site personalization to choose content and advertising likely to appeal to an individual belonging to one or more groups for which data has been collected. For example, a site that sells music CDs might advertise certain CDs based on the age of the user and the data aggregate for their age group. Online analytic processing (OLAP) is a simple type of data aggregation in which the marketer uses an online reporting mechanism to process the information. Data aggregation can be user-based: personal data aggregation services offer the user a single point for collection of their personal information from other Web sites. The customer uses a single master personal identification number (PIN) to give them access to their various accounts (such as those for financial institutions, airlines, book and music clubs, and so on). Performing this type of data aggregation is sometimes referred to as "screen scraping. Relational algebra Lecture Slides available: PDF PowerPoint Relational Algebra Contents Terminology Operators - Write Operators - Retrieval Relational SELECT Relational PROJECT SELECT and PROJECT Set Operations - semantics SET Operations - requirements UNION Example INTERSECTION Example DIFFERENCE Example CARTESIAN PRODUCT CARTESIAN PRODUCT example JOIN Operator JOIN Example Natural Join OUTER JOINs OUTER JOIN example 1 OUTER JOIN example 2 In order to implement a DBMS, there must exist a set of rules which state how the database system will behave. For instance, somewhere in the DBMS must be a set of statements which indicate than when someone inserts data into a row of a relation, it has the effect which the user expects. One way to specify this is to use words to write an `essay' as to how the DBMS will operate, but words tend to be imprecise and open to interpretation. Instead, relational databases are more usually defined using Relational Algebra. Relational Algebra is : the formal description of how a relational database operates an interface to the data stored in the database itself the mathematics which underpin SQL operations Operators in relational algebra are not necessarily the same as SQL operators, even if they have the same name. For example, the SELECT statement exists in SQL, and also exists in relational algebra. These two uses of SELECT are not the same. The DBMS must take whatever SQL statements the user types in and translate them into relational algebra operations before applying them to the database. Terminology Relation - a set of tuples.

28

Tuple - a collection of attributes which describe some real world entity. Attribute - a real world role played by a named domain. Domain - a set of atomic values. Set - a mathematical definition for a collection of objects which contains no duplicates.

Operators - Write INSERT - provides a list of attribute values for a new tuple in a relation. This operator is the same as SQL. DELETE - provides a condition on the attributes of a relation to determine which tuple(s) to remove from the relation. This operator is the same as SQL. MODIFY - changes the values of one or more attributes in one or more tuples of a relation, as identified by a condition operating on the attributes of the relation. This is equivalent to SQL UPDATE. Operators - Retrieval There are two groups of operations: Mathematical set theory based relations: UNION, INTERSECTION, DIFFERENCE, and CARTESIAN PRODUCT. Special database operations: SELECT (not the same as SQL SELECT), PROJECT, and JOIN. Relational SELECT SELECT is used to obtain a subset of the tuples of a relation that satisfy a select condition. For example, find all employees born after 1st Jan 1950: SELECTdob '01/JAN/1950'(employee) Relational PROJECT The PROJECT operation is used to select a subset of the attributes of a relation by specifying the names of the required attributes. For example, to get a list of all employees surnames and employee numbers: PROJECTsurname,empno(employee) SELECT and PROJECT SELECT and PROJECT can be combined together. For example, to get a list of employee numbers for employees in department number 1:

Figure : Mapping select and project Set Operations - semantics Consider two relations R and S. UNION of R and S the union of two relations is a relation that includes all the tuples that are either in R or in S or in both R and S. Duplicate tuples are eliminated. INTERSECTION of R and S the intersection of R and S is a relation that includes all tuples that are both in R and S. DIFFERENCE of R and S the difference of R and S is the relation that contains all the tuples that are in R but that are not in S. SET Operations - requirements For set operations to function correctly the relations R and S must be union compatible. Two relations are union compatible if they have the same number of attributes the domain of each attribute in column order is the same in both R and S. UNION Example

Figure : UNION INTERSECTION Example

29

Figure : Intersection DIFFERENCE Example

Figure : DIFFERENCE CARTESIAN PRODUCT The Cartesian Product is also an operator which works on two sets. It is sometimes called the CROSS PRODUCT or CROSS JOIN. It combines the tuples of one relation with all the tuples of the other relation. CARTESIAN PRODUCT example

Figure : CARTESIAN PRODUCT JOIN Operator JOIN is used to combine related tuples from two relations: In its simplest form the JOIN operator is just the cross product of the two relations. As the join becomes more complex, tuples are removed within the cross product to make the result of the join more meaningful. JOIN allows you to evaluate a join condition between the attributes of the relations on which the join is undertaken. The notation used is R JOINjoin condition S JOIN Example

Figure : JOIN Natural Join Invariably the JOIN involves an equality test, and thus is often described as an equi-join. Such joins result in two attributes in the resulting relation having exactly the same value. A `natural join' will remove the duplicate attribute(s). In most systems a natural join will require that the attributes have the same name to identify the attribute(s) to be used in the join. This may require a renaming mechanism. If you do use natural joins make sure that the relations do not have two attributes with the same name by accident. OUTER JOINs Notice that much of the data is lost when applying a join to two relations. In some cases this lost data might hold useful information. An outer join retains the information that would have been lost from the tables, replacing missing data with nulls. There are three forms of the outer join, depending on which data is to be kept. LEFT OUTER JOIN - keep data from the left-hand table

30

RIGHT OUTER JOIN - keep data from the right-hand table FULL OUTER JOIN - keep data from both tables OUTER JOIN example 1

Figure : OUTER JOIN (left/right) OUTER JOIN example 2

Figure : OUTER JOIN (full) Embedded SQL Embedded SQL is a method of combining the computing power of a programming language and the database manipulation capabilities of SQL. Embedded SQL statements are SQL statements written inline with the program source code of the host language. The embedded SQL statements are parsed by an embedded SQL preprocessor and replaced by host-language calls to a code library. The output from the preprocessor is then compiled by the host compiler. This allows programmers to embed SQL statements in programs written in any number of languages such as: C/C++, COBOL and Fortran. The ANSI SQL standards committee defined the embedded SQL standard in two steps: a formalism called Module Language was defined, then the embedded SQL standard was derived from Module Language.[1] The SQL standard defines embedding of SQL as embedded SQL and the language in which SQL queries are embedded is referred to as the host language. A popular host language is C. The mixed C and embedded SQL is called Pro*C in Oracle and Sybase database management systems. In the PostgreSQL database management system this precompiler is called ECPG. Other embedded SQL precompilers are Pro*Ada, Pro*COBOL, Pro*FORTRAN, Pro*Pascal, and Pro*PL/I. Stored procedure A stored procedure is a subroutine available to applications that access a relational database system. A stored procedure (sometimes called a proc, sproc, StoPro, StoredProc, orSP) is actually stored in the database data dictionary. Typical uses for stored procedures include data validation (integrated into the database) or access control mechanisms. Furthermore, stored procedures can consolidate and centralize logic that was originally implemented in applications. Extensive or complex processing that requires execution of several SQL statements is moved into stored procedures, and all applications call the procedures. One can use nested stored procedures, by executing one stored procedure from within another. Stored procedures are similar to user-defined functions (UDFs). The major difference is that UDFs can be used like any other expression within SQL statements, whereas stored procedures must be invoked using the CALL statement. Stored procedures may return result sets, i.e. the results of a SELECT statement. Such result sets can be processed using cursors, by other stored procedures, by associating a result set locator, or by applications. Stored procedures may also contain declared variables for processing data and cursors that allow it to loop through multiple rows in a table. Stored procedure languages typically include IF, WHILE, LOOP, REPEAT, and CASE statements, and more. Stored procedures can receive variables, return results or modify variables and return them, depending on how and where the variable is declared. Database trigger A database trigger is procedural code that is automatically executed in response to certain events on a particular table or view in a database. The trigger is mostly used for keeping the integrity of the information on the database. For example, when a new record (representing a new worker) is added to the employees table, new records should be created also in the tables of the taxes, vacations, and salaries. The need and the usage Triggers are commonly used to: audit changes (e.g. keep a log of the users and roles involved in changes) enhance changes (e.g. ensure that every change to a record is time-stamped by the server's clock) enforce business rules (e.g. require that every invoice have at least one line item) execute business rules (e.g. notify a manager every time an employee's bank account number changes) replicate data (e.g. store a record of every change, to be shipped to another database later) enhance performance (e.g. update the account balance after every detail transaction, for faster queries)

31

The examples above are called Data Manipulation Language (DML) triggers because the triggers are defined as part of the Data Manipulation Language and are executed at the time the data is manipulated. Some systems also support non-data triggers, which fire in response to Data Definition Language (DDL) events such as creating tables, or runtime or and events such as logon, commit, and rollback. Such DDL triggers can be used for auditing purposes. The following are major features of database triggers and their effects: triggers do not accept parameters or arguments (but may store affected-data in temporary tables) triggers cannot perform commit or rollback operations because they are part of the triggering SQL statement (only through autonomous transactions) triggers are normally slow (slowdown the process) INSERT trigger When an INSERT trigger statement is executed, new rows are added to the trigger table and to the inserted table at the same time. The inserted table is a logical table that holds a copy of rows that have been inserted. The inserted table can be examined by the trigger, to determine whether or how the trigger actions are carried out. The inserted table allows comparing the INSERTED rows in the table to the rows in the inserted table. The inserted table is always duplicates of one or more rows in the trigger table. With the inserted table, inserted data can be referenced without having to store the information to the variables. DELETE trigger When a DELETE trigger statement is executed, rows are deleted from the table and are placed in a special table called deleted table. UPDATE trigger When an UPDATE statement is executed on a table that has an UPDATE trigger, the original rows are moved into deleted table, while the update row is inserted into inserted table and the table is being updated. Syntax Code: SQL IF UPDATE<column_name> CREATE TRIGGER trigger1 ON member FOR UPDATE AS IF UPDATE (last_name) BEGIN RAISEERROR('Transaction cannot be processed') ROLLBACK TRANSACTION END Multi-row trigger A multi-row insert can occur from an INSERT with a SELECT statement.Multirow considerations can also apply to multi-row updates and multi-row deletes. Code: SQL CREATE TRIGGER adult_insert ON adult FOR INSERT AS DECLARE @rcnt int SELECT @rcnt = @@rowcount IF(SELECT COUNT(*) FROM member,inserted WHERE member.mem_num = inserted.mem_num)=0 BEGIN PRINT 'Transaction not processed' PRINT 'No entry for this member' ROLLBACK TRANSACTION END IF(SELECT COUNT(*) FROM member,inserted WHERE member.mem_num = inserted.mem_num)<> @rcnt

BEGIN PRINT 'Not all adults have an entry in the member table' PRINT 'Multi-row insert transaction has been rolled backed' ROLLBACK TRANSACTION END View A VIEW is a virtual table, through which a selective portion of the data from one or more tables can be seen. Views do not contain data of their own. They are used to restrict access to the database or to hide data complexity. A view is stored as a SELECT statement in the database. DML operations on a view like INSERT, UPDATE, DELETE affects the data in the original table upon which the view is based. The Syntax to create a sql view is CREATE VIEW view_name AS SELECT column_list FROM table_name [WHERE condition];

32