Database Systems: Traditional File System

DATABASE SYSTEMS Information has been acknowledged as one of the most important resource in todays complex business environment.
The organizations in order to survive in this competitive world need to take effective decisions. To be effective, the decision making process which relies mainly on the internal information processing needs to be in place and able to provide quality information. The effective decision-making therefore requires availability of right information at the right time in the right quantity to the right person. This essentially requires gathering and storing data on various aspects of the organizational working at one place; say on a computer system, in a form which aids in fast and easy access to data and its conversion into information. Besides easy access, a computerized system should also be able to manage data efficiently. The purpose of storing data is served in two possible ways: Using Application Files Using Databases
The two methods have their own advantages but in the current fast changing information requirements, the database offers a more flexible and faster way of storing and retrieving the data. TRADITIONAL FILE SYSTEM
A file system is a component of an operating system that is responsible for managing the space on storage like disks, tapes, CD-ROM, CD-R, etc. Common services provided by a file system are:

Allocation of storage requested by programs Providing access to data Providing an overview of the available data and mounted devices.
A file is also defined as a collection of related records. The files records data pertaining to a specific activity such as payroll, provident fund, income tax, etc. There ware various classifications of files but generally they are classified into following two types: Program Files and Data Files. Program files - Program files are the files that contain the instructions for performing a task on the computer. A collection of program files is what is generally referred as Software. The files may be of various types such as .EXE files, .OVL files, .DLL files, .HLP files, etc. each of which has specific purpose in the software. Data Files Data files are the files that contain data required by programs for processing. The data files may be of different types depending on the software on which they would be processed. For example: .DOC files for word processing, .XLS files for Spreadsheet software, .MDB for database, .DAT for general data file, etc.
There are various types of data files. On the basis of changes, which take place in the data, they are classified into Master Files and the Transaction Files.
Master Files The master file is the one, which contains or stores relatively permanent type of data. For example, Employee No., Employee Name, Basic Pay, Customer Address, Product Code, Bank Account No., etc. This data is updated only when there is any change in such almost static data. For example: The basic pay of only those employees needs to be changed who are due for increment in a specific month or the address of those customers who have shifted their office needs to be updated. Transaction Files This file holds or stores the data temporarily and is used to update the master file data. The data may pertain to creating new employee records, deleting records of employees who have left, modifying the data. This records the day-to-day transactions or activities in the organization. The above files are stored on magnetic storage devices so that the data can be retrieved and processed using various techniques.
File Organization and Access Methods File organization basically refers to the way the data is organized on the magnetic media and its retrieval. This organization method determines the speed of access to any piece of data. The various methods of file organization are: Sequential File Direct or Random File Indexed Sequential File
Sequential File Organization A sequential file is a file with a record structure in which the records are stored one after the other on a tertiary device and can only be accessed in that sequence. It is also sometimes referred to as serial organization as it stores records in a sequential order on some specific attribute. For example, storing student records in the order of roll numbers, employee records in the order of employee code, etc. This facilitates the access as majority of the time the data is processed in the same sequence. This is similar to audiocassette recording, on which to listen to third song, you have to skip or forward the first two songs. The popular storage devices used for this organization are Magnetic tape or cartridge tape. The other devices could be floppy disk, magnetic disk. This organization method is useful and less expensive for processing large number of records. For example, payroll processing, generating accounts receivables, etc. The major disadvantage of this method is that locating or accessing the records is time consuming i.e. to reach the record of student with roll no. 55, one has to wait till the first 54 records are skipped. The bigger the file more is the time to access a record.
Direct or Random File Organization A direct access file has a symbolic name (key) for each record from which the position of the record in the file is calculated. This method of organization does not follow any sequence for storing the records. Instead it retrieves the records on the basis of their key value i.e. the data field that uniquely identifies a record in a random manner. This method of storage and access is useful for application where records need to be retrieved one by one within a reasonably short time. For example: In railway reservation system, online
updation of bookings. Any record can be retrieved instantaneously by typing the key value, for example train number or PNR number or Item code or customer code, etc. This method is similar to the access of songs stored on a gramophone record. Just placing the needle on that track, immediately plays the song. The storage devices that provide such access are: Magnetic disk, Floppy Disk, CD, etc. Since there is no fixed way of organizing the data, the access to a record depends on linking the key field data with the physical location of the record on the storage device. One popular method used for this is called Hashing algorithm. Hashing algorithm is an ingenious and useful form for address calculation. In this the record key is converted into a near random number (by using a mathematical formula) and this number gives the address of the record on the storage. Sometimes instead of the record address, it refers to the address of a group of records (called a bucket or pocket) and the number of logical records in that group is called the bucket capacity. This method is much faster than sequential organization for accessing a specific record but is more expensive to due need for random storage devices. It is therefore not cost-effective for processing large volume of data such as payroll. Indexed Sequential File Organization This file organization method combines the features of sequential and the random file methods. This method permits a file to be used both in sequential access mode as well as random access mode. This method stores the records in sequential order of the key element similar to sequential files. An index sequential file is a file with a record structure that can be accessed at random via an index. The index is often organized as a direct access file. The index entries point to a block or a track on disk. The block or track pointed to will then be sequentially scanned for the requested record. In addition, an index is also created which records the key values and their corresponding physical address on the storage device. This requires the use of random storage devices such as Magnetic disk or CDs. For example, the student file need to be accessed randomly while updating change in their address but sequential access is desirable for generating a list of students who have not paid the fees in a particular semester. In a stock master file, to update the quantity issued for specific items, required random access to item records by giving their codes, but sequential access would be economical for finding say, the items having stock value > 1000. This method is quite useful as it provides dual access to the file but is more expensive than sequential access due to extra storage for index and slower than random access method due to additional index search. The file management system is the software that helps in creating, retrieving and manipulating files on a magnetic storage device. The traditional file management system has certain drawbacks that have affected their utility for developing applications. Drawbacks of the Traditional File Management System
The traditional file environment where the operational data of the organization is widely dispersed into separate files. This puts several limitations that in turn restrict its utility. The major drawbacks are listed below:
Data Redundancy Data redundancy refers to the use or duplication of same data fields in many different application files. This results in repetition or duplication of data in multiple files in the organization. For example, Employee code and name will appear in Payroll file, Provident file, Personal file, Income tax file, etc. The availability of same information at several places in
the event of any change requires updation at all the places. The duplication of data results in variety of problems such as: Wastage of storage capacity, affects the integrity of data etc.
Data Integrity Problems The data integrity problem is a result of the data redundancy. Since data is available in multiple files, lack of updation in any one of the files will result in generation of erroneous information. This makes the data inaccurate, inconsistent and obsolete.
Limited Data Sharing The data stored in various application files has some relationship with each other since they all pertain to the same organization. The exchange of such data is difficult in the file-based system since the applications are located in specific sections such as Personnel, Accounts, etc. Data Availability Constraints For taking right decision, there is a need for availability of information at the right time. Since data is scattered in various files in various sections, the availability of information is constrained. Lack of Program Independence Different programmers using different file formats create the files in the file management system. The application programs make use of these formats in manipulating the data files. This glues the programs with the files and makes them program dependent. In case of any change in data storage device or format, the entire programming needs to be redone thereby involving time and money. As the use of computers in our life is growing, the limitations of the file system are becoming a major obstacle in their use due to above drawbacks. This led to exploring new possibilities for storing and managing data in an efficient and effective manner. Database System In todays world of globalization, to be successful in business, it is a must to have fast access to information both internal and external. Managing data involves storing, organizing, adding, modifying, and deleting data to and from the database.
The database is the term, which refers to the stored operational data of an enterprise and forms an integral part of data processing system. Database is a collection of several related files. It forms the basic building block for an organizational information system. For example: An Employee Database may contain all the files on various aspects of employee such as payroll file, employee file, retirement file, PF file, etc. This overcomes the limitations of the file system listed above and provides the necessary flexibility both to the user as well as the organization resulting in savings in terms of effort and the cost.
The major difference between a database and a data file is that a data file may have more than one use but only one view of the stored data can be satisfied, while a database may have one use and the multiple uses may satisfy multiple views of the data stored. The multiple views are due to the multiple users of the same piece of data. For example: in a banking system, the information about account holders has several users such as savings, loan, fixed deposits, etc.
Database Management System (DBMS)
A Database Management System (DBMS) is computer software that manages easy and quick access to databases and allows its manipulation by inserting, deleting, modifying or querying the data. A DBMS therefore stores, processes, and retrieves data from the database. The DBMS offers a number of services also. It defines a method of storing the data and provides services that allow you to retrieve and manipulate that data. It provides simultaneous access to a large amount of information stored in the database to a number of users. The DBMS also ensures that the data stored in the database is accurate and secure.
Figure : Functionality of Database Management System A database management system integrates the data files into a database and can provide different views of the data to different users depending on their requirements. A DBMS therefore makes it possible to access integrated data across multiple operations, functions and organizational boundaries. DATABASE CONCEPTS/TERMINOLOGY
The most commonly used terms in database systems are discussed below:
Database Any organization must store information about its suppliers, customers, employees, sales orders etc. A database is an organized collection of related information stored at a central location. For example, a company maintains a database of its personnel; a college maintains a database of students, a hospital having a database of its patients records, etc. A database is simply a collection of data. To manage the database that stores a large amount of important information efficiently, an appropriate database system is needed. A database system, thus, is a computerized record-keeping system, the purpose of which is to maintain data and to make that data available on demand.
USERS
APPLICATION PROGRAMS Inventory Personnel
Payroll
Sales & Marketing
Query Language
Integrated Database
Figure: DBMS A simplified users view

Entities An entity is any item about which we collect and store information in the database. The entity may be a tangible object such as: a person, an employee, a place, etc. It may also be intangible object such as an event, concept, condition, etc. In a data processing system, we generally are concerned with collection of similar entities such as employees, customers, students, etc. For example: in a university environment, the entities about which data is stored are: STUDENTS, FACULTY, COURSES, EXAMINATIONS, etc. In a hospital environment, the entities are: PATIENTS, DOCTORS, NURSES, ROOMS, DRUGS, etc. In a manufacturing system, the various entities are: SUPPLIERS, PARTS, CUSTOMERS, ORDERS, SHIPMENTS, etc. For any database to be developed, the selection of entities in the organization is the first step. This depends on the kind of problem to handle, its relationship with other activities, etc. Attributes An attribute is the data field or a field that describes the entity. Therefore every entity has some basic attributes that characterize it. It refers to the various items on which we record the data about entity. For example: Student may be described by attributes such as his Registration Number, Name, Address, Telephone, Date of Birth, Class, Course, etc. We may select the attributes, which may store any type of data such as text, numbers, graphics, audio, video, etc.
Data Value A data value is the actual data or information contained in each data field. This is called data value as it records the specific detail or facts of an individual entity. The data field employee name can take values like S K Gupta, V Kumar, Jyoti, etc. These values could be quantitative, qualitative or descriptive, depending on how the data fields describe the entity. For example:
9811034 PRANAV 77 MODEL TOWN, DELHI 7278593 20/09/84 BCA are the data values that identify a student.
Key Data Field From among the various data fields that describe an entity, the value of some of the data fields can uniquely identify the values contained in other data fields of the same entity. For example, knowing the employee code, we can find his personal data like name, date of birth, date of joining, scale of pay, qualifications, etc. These data fields that helps in identifying other data fields are called Key Data Fields. A key field contains unique data used to identify a record so that it can be easily retrieved and processed. Some examples of key fields are: Customer Code, Vendor Identification, Student Registration No., PAN No., Customer Account No., etc. Sometimes we find that there are more than one key field in an entity. These data fields are called candidates for becoming key data fields and are therefore also referred as candidate key. The selection of primary or key data field is very important as it may directly affect the database design process. OBJECTIVES OF DATA BASE MANAGEMENT SYSTEMS The decision of an organization to store all its operational data in an integrated is based on the primary advantage offered by a database i.e. Centralized control of operational data. Therefore the major objective of database management system is: Integrating Databases to provide Centralized Control over data This is required since for problem solving the data needed may be residing in different databases. This demands that the databases should be integrated. This integration provides a centralized control of organizational data. ADVANTAGES AND DISADVANTAGES OF DBMS This integration or centralized control of data results in several advantages as well as disadvantages that accrue to an organization. These are:
The amount of redundancy in the stored data can be minimized. In a database, data fields are stored once only instead of repeating them. The single data field can be shared by various applications or users. The duplication of data cannot be eliminated completely and instead helps sometimes for improved performance, however it is reduced considerably or minimized. This makes it less expensive as the excessive data storage needs (due to duplication) are avoided.
Problem of inconsistency in the stored data can be avoided. This advantage accrues from the reduced redundancy of data. Since data is stored at one place, any updation needs to be done only once at one place only. This results in improved consistency of data. The sharing of stored data is possible. The database allows sharing of information among many users or applications. This means that data may be stored once and can then be retrieved any number of times for specified purpose by authorized users of the database. This helps in reduced storage requirements and improved consistency. Higher Program Independence. The maintenance of database is easy and simple as programs are tied with the database view instead of file formats. The database view is independent of the physical storage media. Security Restrictions can be Applied. Although several users can share data, access to specific or selected piece of information can be restricted to selected authorized users. The DBA can ensure that access to the confidential or sensitive data in the database is allowed to legitimate users after proper authorization checks such as passwords. The access could be permitted either for retrieve, update, or delete processes or a combination of these. Improved Data Integrity. This requires that the database contain accurate data. The accuracy and consistency of data is ensured by reduced redundancy and absence of inconsistency. To further improve the accuracy of data, validation procedures may used while updating the data. A database that is secure and reliable is said to have data integrity. Improved User Productivity. The DBMS are quite simple to use and provide flexible ways of generating information from the database by using English like queries. Standards can be Enforced. With centralized control of data, the industry standards can be adopted throughout the organization in representing the data. For example, instead of using different units of measurement like kilograms, grams, quintals, we may use a standard unit kilograms.
Although the database system offers clear advantages, there are certain disadvantages associated with them. The major ones are: Enterprise Vulnerability. Centralizing all data of an enterprise in one database may mean that the database becomes an indispensable resource. The survival of the enterprise may depend on reliable information being available from its database. The enterprise therefore becomes vulnerable to the destruction of the database or to unauthorized modification of the database.
Data Quality. Since the database is accessible to users remotely, adequate controls are needed to control users updating data and to control data quality. With increased number of users accessing data directly, there are enormous opportunities for users to damage the data. Unless there are suitable controls, the data quality may be compromised. Cost. Using a database requires high costs in acquiring expensive database management software, high random disk storage, higher memory and more skilled manpower for design, development and maintenance.
Data Integrity. Since a large number of users could be using a database concurrently, technical safeguards are necessary to ensure that the data remain correct during operation. The main threat to data integrity comes from several different users attempting to update the same data at the same time. The database therefore needs to be protected against inadvertent changes by the users.
Confidentiality and Security. When information is centralized and is made available to users from remote locations, the possibilities of abuse are often more than in a conventional system. To reduce the chances of unauthorized users accessing sensitive information, it is necessary to take technical, administrative and, possibly, legal measures. Privacy. This relates to the ethical use of database for specified purposes. They should not intrude into peoples privacy.
DATABASE ADMINISTRATOR (DBA) An organization with integrated database will have some identifiable person who has the central responsibility for the operational data. The Database Administrator (DBA) is a person or group of persons who coordinates and manages all activities and procedures related to organizational database. The major responsibilities of DBA include: Database Planning This involves: understanding information requirements of the organization and the users, selecting the DBMS, specifying security mechanisms. This in effect results in deciding what information should be stored in the database and the type of data model to be used for representing data in the database. Database design This include three functions: defining conceptual schema, storage structure and the mappings between the two. Database creation This involves converting the design framework into actual database by entering the data and storing the same on magnetic devices. Database implementation and Maintenance This is concerned with meeting the changing user needs by modifying the database. This involves changing the structure and organization of the database to meet the changed requirements. The actual deletion or addition of records is the job of the actual users. Liaison with the users This is to ensure that users needs are being met, monitoring how and what is being used, determining user access privileges, etc. Ensuring database security The DBA is responsible to ensure integrity (security and reliability) of data by preventing unauthorized access to the database. To achieve this, it is required to define authorization checks, and validation procedures using the DDL. Backup and Recovery Once an organization creates and starts using the database, their operations become dependent on the smooth operation of the system. In the event of any damage to the database (hardware or software errors, fire, etc), it is essential to be able to recover or buildup the data with minimum delay. This requires that DBA must develop
appropriate backup and recovery procedures in the system such as periodic dumping of database on a backup storage device. Performance monitoring This ensures that the system is serving the users in the best possible way. A general complaint that system is too slow needs to be taken care of. He should respond to changes in physical storage by modifying the definitions. Defining Concurrency procedures This refers to the sharing of database by multiple users. If a piece of data is to be shared by multiple users, this may lead to chaos situation as two or more users may be trying to change some piece of data. It is therefore essential to have appropriate concurrency procedures defined to avoid the problems on account of concurrent access to data.
TYPES OF DATABASE MANAGEMENT SYSTEMS The database systems can be organized into four basic categories. These are: HIERARCHICAL NETWORK RELATIONAL OBJECT ORIENTED
These databases have evolved over a period of time as shown in Figure 8.3. The currently used databases are the relational and object-oriented databases. Hierarchical Databases In hierarchical database organization, the data is represented by a simple tree structure. This tree structure has various levels and the lower level records (called Child) being subordinate to the higher level records (called Parent). The parent record at the top of the tree (database) is usually known as the root record or simply the root. In general, the root may have any number of dependent record types (child) each of which may have any number of lower level record types and so on to any number of levels. This means that in a hierarchical database, a parent record may have more than one child but a child can have only one parent. This represents one-to-many kind of relationship among the data records similar to our family tree. In order to locate any specific record, we have to traverse the tree starting from the top or the parent record and trace down the tree to the child. For example, employees working in an organization are associated with specific departments as shown in Figure 8.4. There are various departments and each department has many employees. One employee can belong to only one department. This shows the one to-many relationship between employees and the departments and no relationship among the employees (or the child).
10
Figure : Evolution of databases
Departments Personnel Employee Name
Accounts
Production
Sales
Suresh Rakesh Kapil
Neeraj
Surender
Neelam Kavita Anil Ram
Figure : Hierarchical database - One-to-many relationship The hierarchical organization is the oldest type but is still used in certain systems such as: reservation system due to some of its strengths. The major strength lies in its simplicity to implement and faster updation of data as the relationship between the parent and the child is pre-defined. However it has many drawbacks as well. This is a rigid structure, as adding a new data field to the database requires that the entire database be redefined. Also we cannot insert any data in the database unless there is a parent for it. For example, a supplier record cannot be added to the supplier database without supplier supplying some parts. Network Databases In this type of database organization, records and links represent the data similar to the hierarchical database. However, this is a more generalized structure because a record may have any number of superiors or owners as against one in the hierarchical model. This means that in terms of family relationship, each child record can have more than one parent record. The child record is called a member and may be reached through more than one parent called owner. This reflects a many-to-many relationship, which can easily be mapped using the network model. For example, in a university system, the relationship between
11
students, courses and department can be shown as in Figure 8.5. The student A has two owners: Financial management and Sales management. The owner Commerce department has two members: General management and Financial management. Departments
MANAGEMENT COMMERCE
Courses
SALES MGT.
FIN. MGT
GEN. MGT.
Students A A
B B
CC
D D
F F
Figure : Network database - many-to-many relationship This model offers more flexibility as new relationships may be established among data records at different levels. This can easily represent the hierarchical (one-to-many) relationships as well. But the primary disadvantage of this database organization is its complexity in structure. Also the structure needs to be defined in advance. In addition, there is a restriction of number of possible links, data records can have. Relational Databases The relational database organization connects data in different files through the use of a common data fields called a key field. In this arrangement, the data fields are stored in different tables comprising of rows and columns. The tables are called relations; the rows of the table are referred to as tuples or records and columns are referred as attributes. The relation is only a table of records and not the linkage between records. Example - 1: The data on suppliers, Parts and Shipments is shown in various tables. Suppliers Table Supplier No. Supp. Name Parts Table Part No. Part Name City Status Location where stored
Unit of Measure Quantity Supplied
Shipment Table Supplier No. Part No.
Each of these tables resembles a sequential file or table with rows representing records and columns representing data fields. The records are identified by key fields such as Supplier no., Part No. or Supplier No. and Part No. Example2: To store information about employees, different tables such as an employee table, a department table, a salary table etc. can be created.
12
DEPARTMENT Dno 10 20 30 40 DNAME Sales Computer Accounts Production EMPLOYEE ENO 100 101 102 ENAME Jack Kevin Sam SALARY JOB Manager Operator Salesman DNO 20 10 20 ENO 100 101 BASIC 6000 300 COMM 1000
Object Oriented Databases An object-oriented database approach makes use of objects as elements within database files. An object consists of text, graphics, audio and video and the instructions or methods to perform actions on the data. This approach tries to model the real world situations. Unlike hierarchical, network or relational database, which can store text and numeric data, the object-oriented database can store apart from this the audio, video and graphical data as well. This type is object-oriented and is closer to real world features while other approaches are record-oriented and closer to computer system. RELATIONAL DATABASE MANAGEMENT SYSTEM (RDBMS) The Relational Database Management System (RDBMS) is based on the relational model that was proposed by Dr. E.F.Codd in the year 1970. In this model, data is stored in two-dimensional tables called relations. Some of the examples of RDBMS are Oracle 7.x, Sybase, Ingress, Microsoft Access and Unify. In 1985, Dr. Codd laid out 12 rules that must be followed by any DBMS to be considered as relational. Oracle follows the maximum of the Codd rules i.e. 11 rules. Other DBMSs like MS Access covers 7 and Sybase covers 9 rules approximately. As of now, there is no RDBMS that fully implements all the 12 rules of Dr. Codd. The RDBMS is useful as it avoids loss of information during addition or deletion of records in the database. The user need not be aware of any structure of the data and therefore can be used easily. This is highly flexible as records and fields can easily be added, deleted or updated. The relational database model is the most popular model on microcomputers such as: Microsoft Access, Oracle, Paradox, Fox Pro, etc. ELEMENTS OF A DATABASE MANAGEMENT SYSTEM A number of components make up a database management system. The prominent ones are discussed below: Data Dictionary This is an important DBA tool. It is a database in its own right. It contains meta data i.e. data about data. It stores the description of other entities in the system rather than simply raw data. This dictionary file contains: various schemas, mapping definitions, authorization checks, validation rules, etc. Utilities The DBMS has programs that help in the administration and maintenance of database. These may include creating a database, editing and deleting data in the database, to feed, display and input data in proper manner using friendly screens.
Query Language Databases can contain thousands of fields and millions of records. However, we usually are interested in only a part, or subset, of this data. Queries allow us to create a new
13
table that contains only those fields and records in which we are interested. For example, if you have two tables, one with customer name and address information, and another with charge account information, you can create a query to extract from them just the information you want. If your want to see late payers, you can create a new table that lists just the name, phone, and date and amount of the charge. This new table, called a dynaset contains only some of the fields from each table and contains only those records where payment was past due. You can use queries to update or delete groups of records. The dynaset can be used as the basis for a report. Most reports begin by first using a query to gather just the data you want the report to list. There are essentially two ways to create queries: Query-By-Example (QBE) Query Language
In query-by-example (QBE), the users ask for information by using a sample record to specify the criteria for selecting records. There is a fill-in form, in which you first select the fields you want to include and then specify what records are to be listed. To narrow the list, we can use selection criteria that specify a field to look in and value to look for. The value can be text, numbers, or dates. After specifying the fields to be included and the criteria to be used, the program creates the new table, i.e. the dynaset. This new table contains only those records that match the criteria specified. For example: The criteria for selecting those employees who have joined the organization after 1998 would be filled in by entering the following in the date of joining field , >= 01/01/1998. When the query is executed, only those records that match the specified criteria are listed in the dynaset.
Queries can also be written out like a programming language. The most popular query language is structured query language (SQL). This provides a detailed way to specify queries. The structured query language is the most popular easy to use query language that allows selection and retrieval of data based on simple English like statements. Each Language has its own vocabulary and procedures. This involves two languages: Data Definition Language (DDL) and Data Manipulation Language (DML). DDL describes the structure (content and format) of the data stored in the database is defined. This definition is also called schema definition and provides a link between the physical and logical views of the database. DML provides the users with procedures to insert delete or update data in the database using the efficient access methods. The SQL makes use of verbs such as SELECT, DELETE, MODIFY. Forms - Database tables are not very interesting to view as they look much like a spreadsheet. To make a database more user-friendly, forms are created and displayed on the screen. People then type data into these various forms and it is automatically entered into the database. Once the data has been entered, the form can also be used to view, edit, or delete it. A payroll accountant looking at a payroll database might see the salaries of every employee while a tax accountant might see just the withholding information.
Reports - Databases contain vast amounts of information, more than most individuals need, especially in large organizations. Most people dont actually use a database itself. Generally, they look through reports containing just the information they are interested in. many such reports can be created from the same database. Using the same database of employees, you might create a report that references just names and addresses or a report showing salaries and benefits.
14
Exporting to a spreadsheet - Database are great for storing data but not as powerful as spreadsheets when it comes to analyzing it, so data from the database is frequently exported to a spreadsheet, in most cases just the set or records appropriate to the problem being looked at. For example, in a sales database, you might download to a spreadsheet Januarys sales in California to see how they compare to last years or another states sales. Exporting to a word processor - Its common to use a query to isolate appropriate database information and then export that data or link it, usually in the form of a table, to a document youre creating on a word processor. Its even more common to use the database in conjunction with a word processing program to mail merge documents. The word processing program is used to create the form letter or other document containing merge codes that refer to fields in the database from which the data are retrieved when the document is merge-printed. Data Access Security It is concerned with integrity of the data in the database. The DBA assigns various data access privileges to the users depending upon the sensitivity of data. The privileges may be to read only, to update to delete, etc. The user authentication and access restrictions are meant to secure the data from malicious users. System Recovery The DBA is responsible for recovering the contents of the database in case of any hardware or software failures. This may include backup of database and the transactions, which cause changes in the database. NORMALIZATION Normalization is a method of breaking down complex table structures into simple table structures by using certain rules to form well-defined relations. This reduces redundancy and discrepancies in the data and eliminates the problems of inconsistency and disk space usage. Normalization results in the formation of simple tables that satisfy certain specified rules and represent certain normal forms. Normal forms are the categories of relations defined to prevent discrepancies. Normal forms are used to ensure that the database is prevented from various anomalies and inconsistencies. A table structure is always in a certain normal form. Several normal forms have been identified and the most popular normal forms are: First Normal Form Second Normal Form Third Normal form Fourth Normal Form
ENTITY RELATIONSHIP MODEL The Entity Relationship (ER) model gives a design of the information stored by a business. It illustrates the data and the relationships between the data. Database designers use the ER diagrams as a tool to build the logical database design of a system. An ER diagram represents the following 3 three elements: Entity - An entity is any object, item, place, person, concept, or activity about which a business needs to store information. An entity is an object that can be easily identified with a distinct set of properties. Examples of entities are employee, department, sales order, product, customer, and student. Entities are the building blocks of a database. An entity corresponds to a record.
15
For example, a student record in a student-administration database is a representation of an actual person. In the diagramming technique, a rectangular box represents an entity and contains the name of the entity.
STUDENT COURSE
GRADE
Attribute - An attribute is a property of a given entity. It describes a part of the entity and provides some information about it. An entity can have one or more attributes. For example, for the employee entity, the attributes are the employee name, designation, salary, address, etc. Similarly, for the CUSTOMER entity, the attributes can be the Customer-ID, Name, Address, Phone-No, etc. An attribute usually corresponds to a field in a record. Attributes are depicted as ellipses, labeled with the name of the property.
Address
Name
CUSTOMER
Customer-ID
Phone-No
Primary Key - A primary key is a group of one or more attributes that uniquely identifies a row in a table. In the following relation, either the attribute DEPT_ID or NAME can be used as a primary key as both the attributes carry distinct values. DEPT_ID 1 2 3 4
Primary Key
NAME Sales Computer Accounts Production
LOCATION New York Houston Boston New York
If no single attribute in the table has unique values, a combination of any number of attributes can be used to identify the rows. The key with multiple attributes is known as a composite key. In the example shown below, the combination of PRODUCT_ID and SUPPLIER_ID results in all unique values and can be used as a Composite Primary key.
PRODUCT_ID P01 SUPPLIER_ID S01 PRICE 1000
16
P01 P01 P02 P02 P03
S02 S03 S01 S04 S05
3000 2300 4500 1000 680
Note that a primary key cannot contain a NULL value. The primary key must uniquely identify each row in an entity; thus if a primary key value is null, it wouldnt be able to identify anything. Relationship - A relationship is an association between entities. It is used to establish a connection between a pair of logically related entities. For example, consider a relationship between the entities EMPLOYEE and DEPARTMENT. Each employee belongs to a department or a department contains many employees. Thus, there is a one-to-many relationship between the department and the employee.
EMPLOYEE
# * o o
empno name salary job
DEPARTMENT
# * o
deptno name location
Here, #
*
Indicates Primary key

Mandatory attribute
Optional attribute
A relationship is represented using a diamond labeled with the name of the relationship. For example, if students studying various courses, the entities will be STUDENT and COURSE and the relationship between them is Studies.
STUDENT
Studies
COURSE
NORMALISATION Normalisation is a series of steps that enables us to identify the existence of potential problems called update anomalies in the design of a relational database. This process also supplies methods for correcting these problems. Definition: Normalisation is a process of successive reduction of a given set of relations in a better form.
17
The normalization process involves converting tables into various types of normal forms. A table in a particular normal form possesses certain collection of properties. There are several normal forms, the most common being 1NF, 2NF, 3NF & 4NF. They form a progression in which a table that is in 1NF is better than a table that is not in 1NF, a table that is in 2NF is better than the table that is in 1NF & so on. The goal of normalization process is to allow you to take a collection of tables & produce a new collection of tables that represents the same information but is free of problems. In this context, the concept of Functional dependence & keys are important to understand.
Decomposition
Decomposition refers to the breaking down of one table into multiple tables. Any database design process involves decomposition. Decomposition is almost similar to Projection operation of Relational Algebra. As we know in projection operation, we choose only the needed columns of a table & discarding the rest i.e. we select rows with specified columns. Here discarding does not mean that the columns are lost forever, but they are simply placed in a table where they should logically belong. Recall that the concept of Redundancy means unwanted or uncontrolled duplication of data. Redundancy apart from duplication also leads to data inconsistency & loss of data integrity. Lets take a table of students with the data. Roll No. St. Name Subject code 101 101 101 102 102 102 Suresh Suresh Suresh Harish Harish Harish S1 S2 S3 S1 S2 S3 Subject Name English OR BDP English OR BDP Marks 80 56 78 39 75 75 So
We can see some redundancy here i.e. St. Names & Subject name. So it is not a good idea to keep the two together. This is the problem of decomposition. So to minimize the redundancy we follow a simple rule. Keep one fact in one place.
So we decompose the above table into two tables as shown:
Examination Table
Student
RollNo St-name Sub-code Sub-name Marks
RollNo St-name 18
Subject
Sub-code Sub-name Marks So now we have student information in student table & subject information in subjects table.
Assume RollNo & Sub-code are the primary keys of the two tables. This arrangement has taken care of the duplication problem. Now for each student there will be only one entry in student table & only one entry in subject table. So no repeat entries, hence no redundancy. But in this process we have introduced one undesirable problem where do we have information as to which student has obtained how many marks in which subject. We have lost this information. We have no link between students and subjects.
Loss of information due to decomposition is called Lossy Decomposition

Lossy decomposition therefore is not acceptable. So we need some form of decomposition in which we do not lose information on either the data in our tables or the interrelationship between various data items. So we should be able to successfully reassemble the split table so as to recreate the original data. Then the decomposition is successful.
When all information found in the original database is preserved after decomposition, we call it lossless decomposition or non-lossy decomposition.
Examination RollNo St-name Sub-code Sub-name Marks
Student RollNo St-name Result RollNo Sub-Code 19 Marks
Subject
Sub-code Sub-name
We can create the original table from the above 3 tables. There are referential integrity relationship in the above tables such as shown.

Between Student & Result tables based on RollNo Between Subject and Result tables based on Subject code
Student table Rollno
Roll No
Subject table Sub-code
Result table Rollno Sub-code
If we join the three tables to perform recomposition, we would get the following columns (after eliminating duplicate columns):Rollno St-name Sub-code Sub-name Marks Thus we have been able to preserve the data and the relationship between data elements even after decomposition. Further if we examine the functional dependency: In examination table there were three functional dependencies namely: 20
Rollno Sub-code Rollno, Sub-code
St-name Sub-name Marks
When we decompose this table into two-table structure we lose one of them. So to find whether the decomposition is lossy or loss less, we can examine FDs & if we lose the original FDs then it is lossy . When we create 3-Table structure, we regain the lost FDs. So, in lossy decomposition we lose some of the FDs relationships while in lossless decomposition, all FDs relationship are preserved.
FDs after lossy decomposition

Rollno Sub-code St-name Sub-name
FDs after lossloss decomposition

Rollno Sub-code Rollno, Sub-code St-name Sub-name Marks
Note: Difference between Decomposition & Normalization is decomposition does not abide by any formal rules while normalization follow formal rules. When we apply a normalization rule, the Database design takes the next logical form- called the Normal form.
CASE STUDY Assume the Order table with structure Order No Order Date Customer No
21
Item No Item Name Quantity Rate Bill Amount
This table contains information about the order received from various customers. The table identifies an Order based on order no. column, Customer based on customer no. & Item based on Item no.column. So for given order several entries will repeat. To convert the above into First Normal Form (1NF) A table is in 1NF if it doesnt contain any repeating column or repeating groups of columns. So order table does not confirm to the principle of 1NF. To bring it into 1NF we need to take the help of decomposition technique & we need to ensure that decomposition is lossless. Simple Principle is: Move all the repeating columns to another table. So we now have two tables: a) The modified order table b) A new table called order-item that contains the repeating columns. ORDER Order No Order Date Customer No ORDER-ITEM Item No Item Name Quantity Rate Bill Amount
If we test the above tables, we find this is lossy decomposition. Judge it by joining the two tables & we will not get the original table. So we have to add another column to make it lossless. i.e. In Order-item table we add order no column to link an item sold to the order no.
ORDER Order No Order Date Customer No
ORDER-ITEM Order No. Item No Item Name } Primary Key }
22
Quantity Rate Bill Amount So we insert a referential integrity relationship based on order no between the two tables. Order no. in order table will be the primary key & order no. in the order-item table will be foreign key. Now this is lossless decomposition as we can preserve the original data relationship. This table is in the 1NF. Second Normal Form A table is in the 2nd NF if it is in the first normal form and if all non-key columns in the table depend on the entire primary key. So the pre-requisites for a database to be in the second normal form are : i) All the tables should be in the INF ii) No non-key attribute is dependent on only a portion of the primary key. Note: If the primary key of a table contains only a single column, the table is automatically in 2NF. Lets examine this in our case study problem. The first condition is already satisfied. The two table were: ORDER Order No Order Date Customer No ORDER-ITEM Order No. Item No Item Name Quantity Rate Bill Amount } Primary Key }
The first table contains 3 column. The primary key is order no. From Order No., we can derive other nonkey columns i.e. Order date & Customer No. Also the non-key attributes do not depend on each other, So this table is in 2NF. The second table contains the columns - Order No. & Item No. as a composite Key. Can we determine the values of the other non-Key columns of the table by using the primary Key. Not Quite. We can determine Item Name from Item No. above (i.e. part of composite key). We do not need order no. for this purpose. Likewise unit price can also be found from item code. Thus all the non-key columns
23
do not depend on the entire primary key but part of the key. So this table fails to satisfy the second condition of 2NF. So follow a simple strategy:
Move columns that do not depend on the entire primary key to another table
So we need to decompose the order-item table to bring it in 2NF. But why we need to do so ? Answer is Due to update Anomalies. Update Anomalies refers to the problems that we are likely to face if we maintain the existing table structure. These problems are classified into three SQL operations INSERT DELETE & UPDATE INSERT PROBLEMS - Assume we have a new item, which is not yet available in the Item Table. Can we add information regarding this item in the order table. No. we cannot because unless/until an order is placed for that item, we cannot insert a row for that item in the order-line table. UPDATE PROBLEMS Suppose we have a product with Item NO. 100 & Item Name as HD. Now we need to modify & assign this code to FDD Then we need to change this description in many rows of the table because this item could be a part of many rows. So locate and amend. DELETE PROBLEMS Suppose we have only one order for an Item say Item NO. 108, if we cancel this order then all the data about item is lost with that as the row is deleted from the order Item Table So we will loose all the Item data as there is no other data. Therefore the table would look like as:
ORDER Order No Order Date Customer No
ORDER-ITEM Order No. } Primary Key Item No } Quantity Amount
ITEM Item No Item name Rate
Sp there is referential integrity relationship between order-item & item tables based on item no. Now lets verify the update anomalies. Insert problem We can now insert a new item in the item table without needing to have even a single order for this item. Update We can easily update the item name & unit price only once or at one place. Delete Even if we delete an order, the information about the item is not completely lost. THIRD NORMAL FORM (3NF) A table is in 3NF if it is in the 2NF and if all non-key columns in the table depend non-transitively on the entire primary key i.e. simply speaking a table should be in 2NF & every non-key column in the table must be independent of all other non-key columns. What is transitive Dependency?
24
It is an indirect relationship between two columns. a) If there is a functional dependency between 2 columns A & B such that B is FD on A. b) If there is a functional dependency between 2 columns B & C such that C is functionally dependant on B. Then we say that C transitively depends on A & we represent this as A B C i.e. C transitively depends on A. So for 3NF we should identify all such transitive dependency and remove them or get rid of them. In order table, there is no T.D. because order date and customer no. are F.D. on order number. In Order-item table, item-no and quantity are FD on order no. This is not true for Bill Amt. As bill amt is calculated as Rate x Quantity. Now rate is available in item table and quantity is available in order-item table. So the column Bill amt does not directly depend on the primary key of order-item table. So we need to remove this T.D. i.e. remove Bill amt column. In Item table, there is no T.D. as unit price and item name are F.D. on Item No. So we need to get rid of bill amt column to bring the table into 3NF.
ORDER Order No Order Date Customer No Lets revisit Anamolies:
ORDER-ITEM Order No. } Primary Key Item No } Quantity
ITEM Item No Item name Rate
Insert Since there is no bill amt column hence we shall not be required to recalculate anything when we insert any item. Update: Even if we make changes in any order row, we need not recalculate the total bill amt. Delete If we delete one order row, for an order having multiple rows, we need not recalculate the total bill amt. So delete anomaly is taken care of.
25

Database Systems: Traditional File System

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Database Systems: Traditional File System

Uploaded by

Copyright:

Available Formats

DATABASE SYSTEMS Information has been acknowledged as one of the most important resource in todays complex business environment.

Database Management System (DBMS)

Sales & Marketing

Database Management System (DBMS)

Database Management System (DBMS)

Figure: DBMS A simplified users view

Figure : Evolution of databases

Departments Personnel Employee Name

Suresh Rakesh Kapil

Neelam Kavita Anil Ram

Unit of Measure Quantity Supplied

Shipment Table Supplier No. Part No.

NAME Sales Computer Accounts Production

LOCATION New York Houston Boston New York

P01 P01 P02 P02 P03

S02 S03 S01 S04 S05

3000 2300 4500 1000 680

empno name salary job

deptno name location

Indicates Primary key

So we decompose the above table into two tables as shown:

RollNo St-name Sub-code Sub-name Marks

Loss of information due to decomposition is called Lossy Decomposition

Examination RollNo St-name Sub-code Sub-name Marks

Student RollNo St-name Result RollNo Sub-Code 19 Marks

Student table Rollno

Result table Rollno Sub-code

Rollno Sub-code Rollno, Sub-code

St-name Sub-name Marks

FDs after lossy decomposition

FDs after lossloss decomposition

Item No Item Name Quantity Rate Bill Amount

ORDER Order No Order Date Customer No

ORDER-ITEM Order No. Item No Item Name } Primary Key }

ORDER Order No Order Date Customer No

ORDER-ITEM Order No. } Primary Key Item No } Quantity Amount

ITEM Item No Item name Rate

ORDER Order No Order Date Customer No Lets revisit Anamolies:

ORDER-ITEM Order No. } Primary Key Item No } Quantity

ITEM Item No Item name Rate

You might also like