Organization of Records in Files
There are several ways of organizing records in files.
• • •

heap file organization. Any record can be placed anywhere in the file where there is space for the record. There is no ordering of records. sequential file organization. Records are stored in sequential order, based on the value of the search key of each record. hashing file organization. A hash function is computed on some attribute of each record. The result of the function specifies in which block of the file the record should be placed -- to be discussed in chapter 11 since it is closely related to the indexing structure. clustering file organization. Records of several different relations can be stored in the same file. Related records of the different relations are stored on the same block so that one I/O operation fetches related records from all the relations.

Sequential File Organization
1. A sequential file is designed for efficient processing of records in sorted order on some search key. o Records are chained together by pointers to permit fast retrieval in search key order. o Pointer points to next record in order. o Records are stored physically in search key order (or as close to this as possible). o This minimizes number of block accesses. o Figure 10.15 shows an example, with bname as the search key. 2. It is difficult to maintain physical sequential order as records are inserted and deleted. o Deletion can be managed with the pointer chains. o Insertion poses problems if no space where new record should go. o If space, use it, else put new record in an overflow block. o Adjust pointers accordingly. o Figure 10.16 shows the previous example after an insertion. o Problem: we now have some records out of physical sequential order. o If very few records in overflow blocks, this will work well. o If order is lost, reorganize the file. o Reorganizations are expensive and done when system load is low. 3. If insertions rarely occur, we could keep the file in physically sorted order and reorganize when insertion occurs. In this case, the pointer fields are no longer required.

Clustering File Organization
1. One relation per file, with fixed-length record, is good for small databases, which also reduces the code size.

2. Many large-scale DB systems do not rely directly on the underlying operating system for file management. One large OS file is allocated to DB system and all relations are stored in one file. 3. To efficiently execute queries involving , one may store the depositor tuple for each cname near the customer tuple for the corresponding cname, as shown in Figure 10.19. 4. This structure mixes together tuples from two relations, but allows for efficient processing of the join. 5. If the customer has many accounts which cannot fit in one block, the remaining records appear on nearby blocks. This file structure, called clustering, allows us to read many of the required records using one block read. 6. Our use of clustering enhances the processing of a particular join but may result in slow processing of other types of queries, such as selection on customer. For example, the query
aaaaaaaaaaaa¯select * from customer

now requires more block accesses as our customer relation is now interspersed with the deposit relation. 7. Thus it is a trade-off, depending on the types of query that the database designer believes to be most frequent. Careful use of clustering may produce significant performance gain.

Data Dictionary Storage
1. The database also needs to store information about the relations, known as the data dictionary. This includes: o Names of relations. o Names of attributes of relations. o Domains and lengths of attributes. o Names and definitions of views. o Integrity constraints (e.g., key constraints). plus data on the system users:
o o

Names of authorized users. Accounting information about users.

plus (possibly) statistical and descriptive data: Number of tuples in each relation. Method of storage used for each relation (e.g., clustered or non-clustered). 2. When we look at indices (Chapter 11), we'll also see a need to store information about each index on each relation: o Name of the index.
o o

Name of the relation being indexed. Attributes the index is on. Type of index. 3. This information is, in itself, a miniature database. We can use the database to store data about itself, simplifying the overall structure of the system, and allowing the full power of the database to be used to permit fast access to system data. 4. The exact choice of how to represent system data using relations must be made by the system designer. One possible representation follows.
o o o 5. 6. aaaaaaaaaaaa¯System-catalog-schema = (relation-name, numberattrs)

7. 9.

8. 10.

Attr-schema = (attr-name, rel-name, domain-type, position, length) User-schema = (user-name, encrypted-password, group) Index-schema = (index-name, rel-name, index-type, index-attr) View-schema = (view-name, definition)

11. 13.


2.4 Database Systems: Mapping Constraints
An E-R scheme may de ne certain constraints to which the contents of a database must conform.

Mapping Cardinalities: express the number of entities to which another entity can be

associated via a relationship. For binary relationship sets between entity sets A and B, the mapping cardinality must be one of: 1. One-to-one: An entity in A is associated with at most one entity in B, and an entity in B is associated with at most one entity in A. (Figure 2.3) 2. One-to-many: An entity in A is associated with any number in B. An entity in B is associated with at most one entity in A. (Figure 2.4) 3. Many-to-one: An entity in A is associated with at most one entity in B. An entity in B is associated with any number in A. (Figure 2.5) 4. Many-to-many: Entities in A and B are associated with any number from each other. (Figure 2.6) The appropriate mapping cardinality for a particular relationship set depends on the real world being modeled. (Think about the Cust Acct relationship...)

Existence Dependencies: if the existence of entity X depends on the existence of entity Y,

then X is said to be existence dependent on Y. (Or we say that Y is the dominant entity and X is the subordinate entity.) For example, - Consider account and transaction entity sets, and a relationship log between them.

- This is one-to-many from account to transaction. - If an account entity is deleted, its associated transaction entities must also be deleted. - Thus account is dominant and transaction is subordinate.

Database Systems: Overview of Physical Storage Media
1. Several types of data storage exist in most computer systems. They vary in speed of access, cost per unit of data, and reliability.

Cache: most costly and fastest form of storage. Usually very small, and managed by the

operating system.

Main Memory (MM): the storage area for data available to be operated on.

-General-purpose machine instructions operate on main memory. -Contents of main memory are usually lost in a power failure or "crash". -Usually too small (even with megabytes) and too expensive to store the entire database.

Flash memory: EEPROM (electrically erasable programmable read-only memory).

-Data in flash memory survive from power failure. -Reading data from flash memory takes about 10 nano-secs (roughly as fast as from main memory),and writing data into flash memory is more complicated: write-once takes about 4-10 microsecs. -To overwrite what has been written, one has to rst erase the entire bank of the memory. It may support only a limited number of erase cycles (104 to 106). -It has found its popularity as a replacement for disks for storing small volumes of data (5-10 megabytes).

Magnetic-disk storage: primary medium for long-term storage.

-Typically the entire database is stored on disk. -Data must be moved from disk to main memory in order for the data to be operated on. -After operations are performed, data must be copied back to disk if any changes were made. -Disk storage is called direct access storage as it is possible to read data on the disk in any order (unlike sequential access).

-Disk storage usually survives power failures and system crashes.

Optical storage: CD-ROM (compact-disk read-only memory), WORM (write-once read-many)

disk (for archival storage of data), and Juke box (containing a few drives and numerous disks loaded on demand).

Tape Storage: used primarily for backup and archival data.

-Cheaper, but much slower access, since tape must be read sequentially from the beginning. -Used as protection from disk failures! 2. The storage device hierarchy is presented in Figure 10.1, where the higher levels are expensive (cost per bit),fast (access time), but the capacity is smaller.

3. Another classi cation: Primary, secondary, and tertiary storage.

(a) Primary storage: the fastest storage media, such as cash and main memory. (b) Secondary (or on-line) storage: the next level of the hierarchy, e.g., magnetic disks. (c) Tertiary (or o -line) storage: magnetic tapes and optical disk juke boxes. 4. Volatility of storage. Volatile storage loses its contents when the power is removed. Without power backup,data in the volatile storage (the part of the hierarchy from main memory up) must be written to nonvolatile storage for safekeeping.

Magnetic Disk
Database Systems: Physical Characteristics of Disks
1. The storage capacity of a single disk ranges from 10MB to 10GB. A typical commercial database may require hundreds of disks. 2. Figure 10.2 shows a moving-head disk mechanism.

Each disk platter has a at circular shape. Its two surfaces are covered with a magnetic

material and information is recorded on the surfaces. The platter of hard disks are made from rigid metal or glass,while oppy disks are made from exible material.

The disk surface is logically divided into tracks, which are subdivided into sectors. A sector

(varying from 32 bytes to 4096 bytes, usually 512 bytes) is the smallest unit of information that can be read from or written to disk. There are 4-32 sectors per track and 20-1500 tracks per disk surface.

• • • •

The arm can be positioned over any one of the tracks. The platter is spun at high speed. To read information, the arm is positioned over the correct track. When the data to be accessed passes under the head, the read or write operation is

performed. 3. A disk typically contains multiple platters (see Figure 10.2). The read-write heads of all the tracks are mounted on a single assembly called a disk arm, and move together.

• • • • •

Multiple disk arms are moved as a unit by the actuator. Each arm has two heads, to read disks above and below it. The set of tracks over which the heads are located forms a cylinder. This cylinder holds that data that is accessible within the disk latency time. It is clearly sensible to store related data in the same or adjacent cylinders.

4. Disk platters range from 1.8" to 14" in diameter, and 5"1/4 and 3"1/2 disks dominate due to the lower cost and faster seek time than do larger disks, yet they provide high storage capacity. 5. A disk controller interfaces between the computer system and the actual hardware of the disk drive. It accepts commands to r/w a sector, and initiate actions. Disk controllers also attach checksums to each sector to check read error. 6. Remapping of bad sectors: If a controller detects that a sector is damaged when the disk is initially formatted,or when an attempt is made to write the sector, it can logically map the sector to a di erent physical location. 7. SCSI (Small Computer System Interconnect) is commonly used to connect disks to PCs and workstations.Mainframe and server systems usually have a faster and more expensive bus to connect to the disks. 8. Head crash: why cause the entire disk failing (?). 9. A fixed dead disk has a separate head for each track | very many heads, very expensive. Multiple disk arms:allow more than one track to be accessed at a time. Both were used in high performance mainframe systems but are relatively rare today.

Lecture Notes: Relational Algebra
Det finns inget kapitel om relationsalgebra i kursen. Jag hade först tänkt ha med ett, men relationsalgebra passar inte riktigt i en grundkurs som den här. I stället finns en kort förklaring i ordlistan, och för den som vill läsa mer finns dessutom dessa föreläsningsanteckningar på engelska.

What? Why?
• • •

• • •

Similar to normal algebra (as in 2+3*x-y), except we use relations as values instead of numbers, and the operations and operators are different. Not used as a query language in actual DBMSs. (SQL instead.) The inner, lower-level operations of a relational DBMS are, or are similar to, relational algebra operations. We need to know about relational algebra to understand query execution and optimization in a relational DBMS. Some advanced SQL queries requires explicit relational algebra operations, most commonly outer join. Relations are seen as sets of tuples, which means that no duplicates are allowed. SQL behaves differently in some cases. Remember the SQL keyword distinct. SQL is declarative, which means that you tell the DBMS what you want, but not how it is to be calculated. A C++ or Java program is procedural, which means that you have to state, step by step, exactly how the result should be calculated.

Relational algebra is (more) procedural than SQL. (Actually, relational algebra is mathematical expressions.)

Set operations
Relations in relational algebra are seen as sets of tuples, so we can use basic set operations.

Review of concepts and operations from set theory
• • • • • • • • • • •

set element no duplicate elements (but: multiset = bag) no order among the elements (but: ordered set) subset proper subset (with fewer elements) superset union intersection set difference cartesian product

Example: The table E (for EMPLOYEE) nr name salary 1 John 100 5 Sarah 300 7 Tom 100 SQL
select salary from E

Result salary 100 300

Relational algebra PROJECTsalary(E)

nr salary select nr, salary 1 100 PROJECTnr, salary(E) from E 5 300 7 100 Note that there are no duplicate rows in the result.

The same table E (for EMPLOYEE) as above. SQL Result Relational algebra
select *

nr name salary SELECTsalary < 200(E)

from E where salary < 200 select * from E where salary < 200 and nr >= 7

1 John 100 7 Tom 100 nr name salary SELECTsalary < 200 and nr >= 7(E) 7 Tom 100

Note that the select operation in relational algebra has nothing to do with the SQL keyword select. Selection in relational algebra returns those tuples in a relation that fulfil a condition, while the SQL keyword select means "here comes an SQL statement".

Relational algebra expressions
Relational algebra name salary PROJECTname, salary (SELECTsalary < 200(E)) or, step by step, using an intermediate result select name, salary John 100 from E where salary < 200 Tom 100 Temp <- SELECTsalary < 200(E) Result <- PROJECTname, salary(Temp) SQL Result

The operations have their own symbols. The symbols are hard to write in HTML that works with all browsers, so I'm writing PROJECT etc here. The real symbols: Operation My HTML Symbol Operation My HTML Symbol Cartesian Projection PROJECT X product Selection Renaming Union SELECT RENAME UNION Join Left outer join JOIN LEFT OUTER JOIN

Right outer RIGHT OUTER join JOIN Full outer join Semijoin FULL OUTER JOIN SEMIJOIN

Intersection INTERSECTION Assignment <-

Example: The relational algebra expression which I would here write as PROJECTNamn ( SELECTMedlemsnummer < 3 ( Medlem ) ) should actually be written

Cartesian product
The cartesian product of two tables combines each row in one table with each row in the other table. Example: The table E (for EMPLOYEE) enr ename dept 1 Bill A 2 Sarah C 3 John A Example: The table D (for DEPARTMENT) dnr dname A Marketing B Sales C Legal SQL Result Relational algebra enr ename dept dnr dname 1 Bill A A Marketing 1 Bill A B Sales 1 Bill A C Legal Sarah C A Marketing select * 2 EXD from E, D 2 Sarah C B Sales 2 Sarah C C Legal 3 John A A Marketing 3 John A B Sales 3 John A C Legal
• • •

Seldom useful in practice. Usually an error. Can give a huge result.

Join (sometimes called "inner join")
The cartesian product example above combined each employee with each department. If we only keep those lines where the dept attribute for the employee is equal to the dnr (the department number) of the department, we get a nice list of the employees, and the department that each employee works for:

select * from E, D where dept = dnr


Relational algebra

enr ename dept dnr dname SELECTdept = dnr (E X D) or, using the equivalent join 1 Bill A A Marketing operation 2 Sarah C C Legal 3 John A A Marketing E JOINdept = dnr D

• • •

• •

A very common and useful operation. Equivalent to a cartesian product followed by a select. Inside a relational DBMS, it is usually much more efficient to calculate a join directly, instead of calculating a cartesian product and then throwing away most of the lines. Note that the same SQL query can be translated to several different relational algebra expressions, which all give the same result. If we assume that these relational algebra expressions are executed, inside a relational DBMS which uses relational algebra operations as its lower-level internal operations, different relational algebra expressions can take very different time (and memory) to execute.

Natural join
A normal inner join, but using the join condition that columns with the same names should be equal. Duplicate columns are removed.

Renaming tables and columns
Example: The table E (for EMPLOYEE) nr name dept 1 Bill A 2 Sarah C 3 John A Example: The table D (for DEPARTMENT) nr name A Marketing B Sales C Legal We want to join these tables, but:
• •

Several columns in the result will have the same name (nr and name). How do we express the join condition, when there are two columns called nr?


• •

Rename the attributes, using the rename operator. Keep the names, and prefix them with the table name, as is done in SQL. (This is somewhat unorthodox.) SQL Result Relational algebra

select * from E as E(enr, ename, dept), D as D(dnr, dname) where dept = dnr select * from E, D where dept = D.nr

enr ename dept dnr dname 1 Bill A A Marketing (RENAME(enr, ename, dept)(E)) JOINdept 2 Sarah C C Legal = dnr (RENAME(dnr, dname)(D)) 3 John A A Marketing nr name dept nr name 1 Bill A A Marketing 2 Sarah C C Legal 3 John A A Marketing

E JOINdept = D.nr D

You can use another variant of the renaming operator to change the name of a table, for example to change the name of E to R. This is necessary when joining a table with itself (see below). RENAMER(E) A third variant lets you rename both the table and the columns: RENAMER(enr, ename, dept)(E)

Aggregate functions
Example: The table E (for EMPLOYEE) nr name salary dept 1 John 100 A 5 Sarah 300 C 7 Tom 100 A 12 Anne null C SQL
select sum(salary) from E

Result Relational algebra sum 500 Fsum(salary)(E)

• •

Duplicates are not eliminated. Null values are ignored.

select count(salary) from E

Result Relational algebra Result: count Fcount(salary)(E) 3 Result: count Fcount(salary)(PROJECTsalary(E)) 2

select count(distinct salary) from E

You can calculate aggregates "grouped by" something: SQL
select sum(salary) from E group by dept

Result dept sum A 200 C 300

Relational algebra F (E)

dept sum(salary)

Several aggregates simultaneously: SQL
select sum(salary), count(*) from E group by dept

Result dept sum count A 200 2 C 300 1

Relational algebra F (E)

dept sum(salary), count(*)

Standard aggregate functions: sum, count, avg, min, max

Example: The table E (for EMPLOYEE) nr name mgr 1 Gretchen null 2 Bob 1 5 Anne 2 6 John 2 3 Hulda 1 4 Hjalmar 1 7 Usama 4 Going up in the hierarchy one level: What's the name of John's boss?


Relational algebra name PROJECTbname ([SELECTpname = "John"(RENAMEP(pnr, pname, pmgr)(E))] JOINpmgr = bnr [RENAMEB(bnr, bname, bmgr)(E)]) Bob or, in a less wide-spread notation PROJECTb.name ([SELECTname = "John"(RENAMEP(E))] JOINp.mgr = b.nr [RENAMEB(E)]) or, step by step P <- RENAMEP(pnr, pname, pmgr)(E) B <- RENAMEB(bnr, bname, bmgr)(E) J <- SELECTname = "John"(P) C <- J JOINpmgr = bnr B R <- PROJECTbname(C)


select b.name from E p, E b where p.mgr = b.nr and p.name = "John"

Notes about renaming:

• •

We are joining E with itself, both in the SQL query and in the relational algebra expression: it's like joining two tables with the same name and the same attribute names. Therefore, some renaming is required. RENAMEP(E) JOIN... RENAMEB(E) is a start, but then we still have the same attribute names.

Going up in the hierarchy two levels: What's the name of John's boss' boss? Relational algebra name PROJECTob.name (([SELECTname = "John"(RENAMEP(E))] JOINp.mgr = b.nr [RENAMEB(E)]) JOINb.mgr = ob.nr Gretchen select [RENAMEOB(E)]) ob.name or, step by step from E p, E
b, E ob where b.mgr = ob.nr where p.mgr = b.nr and p.name = "John"



P <- RENAMEP(pnr, pname, pmgr)(E) B <- RENAMEB(bnr, bname, bmgr)(E) OB <- RENAMEOB(obnr, obname, obmgr)(E) J <- SELECTname = "John"(P) C1 <- J JOINpmgr = bnr B C2 <- C1 JOINbmgr = bbnr OB R <- PROJECTobname(C2)

Recursive closure
Both one and two levels up: What's the name of John's boss, and of John's boss' boss? SQL Result Relational algebra

(select b.name ...) union (select ob.name ...)

name Bob (...) UNION (...) Gretchen

Recursively: What's the name of all John's bosses? (One, two, three, four or more levels.)
• •

Not possible in (conventional) relational algebra, but a special operation called transitive closure has been proposed. Not possible in (standard) SQL (SQL2), but in SQL3, and using SQL + a host language with loops or recursion.

Outer join
Example: The table E (for EMPLOYEE) enr ename dept 1 Bill A 2 Sarah C 3 John A Example: The table D (for DEPARTMENT) dnr dname A Marketing B Sales C Legal List each employee together with the department he or she works at: SQL
select * from E, D where edept = dnr


Relational algebra

or, using an explicit join
select * from (E join D on edept = dnr)

enr ename dept dnr dname 1 Bill A A Marketing E JOINedept = dnr D 2 Sarah C C Legal 3 John A A Marketing

No employee works at department B, Sales, so it is not present in the result. This is probably not a problem in this case. But what if we want to know the number of employees at each department?

select dnr, dname, count(*) dnr from E, D where edept = dnr A group by dnr, dname select dnr, dname, count(*) from (E join D on edept = dnr) group by dnr, dname

Result dname count Marketing 2 Legal 1

Relational algebra

or, using an explicit join


dnr, dname

Fcount(*)(E JOINedept = dnr D)

No employee works at department B, Sales, so it is not present in the result. It disappeared already in the join, so the aggregate function never sees it. But what if we want it in the result, with the right number of employees (zero)? Use a right outer join, which keeps all the rows from the right table. If a row can't be connected to any of the rows from the left table according to the join condition, null values are used: SQL enr select * 1 from (E right outer 2 join D on edept = dnr) 3 null
select dnr, dname, count(*) from (E right outer join D on edept = dnr) group by dnr, dname select dnr, dname, count(enr) from (E right outer join D on edept = dnr) group by dnr, dname


Relational algebra

ename dept dnr dname Bill A A Marketing E RIGHT OUTER Sarah C C Legal JOINedept = dnr D John A A Marketing null null B Sales count 2 1 1 count 2 0 1

dnr dname A Marketing B Sales C Legal dnr dname A Marketing B Sales C Legal

Fcount(*)(E RIGHT OUTER JOINedept = dnr D)
dnr, dname

Fcount(enr)(E RIGHT OUTER JOINedept = dnr D)
dnr, dname

Join types:
• • • •

JOIN = "normal" join = inner join LEFT OUTER JOIN = left outer join RIGHT OUTER JOIN = right outer join FULL OUTER JOIN = full outer join

Outer union
Outer union can be used to calculate the union of two relations that are partially union compatible. Not very common. Example: The table R A B 1 2 3 4 Example: The table S B C 4 5 6 7 The result of an outer union between R and S: A B C 1 2 null 3 4 5 null 6 7

Who works on (at least) all the projects that Bob works on?

A join where the result only contains the columns from one of the joined tables. Useful in distributed databases, so we don't have to send as much data over the network.

To update a named relation, just give the variable a new value. To add all the rows in relation N to the relation R: R <- R UNION N

Domain relational calculus

In computer science, domain relational calculus (DRC) is a calculus that was ntroduced by Michel Lacroix and Alain Pirotte as a declarative database query language for the relational data model In DRC, queries have the form: where each Xi is either a domain variable or constant, and p(<X1, X2, ...., Xn>) denotes a DRC formula. The result of the query is the set of tuples Xi to Xn which makes the DRC formula true. This language uses the same operators as tuple calculus, the logical connectives ∧ (and), ∨ (or) and ¬ (not). The existential quantifier (∃) and the universal quantifier (∀) can be used to bind the variables. Its computational expresivity is equivalent to that of Relational algebra. Examples Let (A, B, C) mean (Rank, Name, ID) and (D, E, F) to mean (Name, DeptName, ID) In this example, A, B, C denotes both the result set and a set in the table Enterprise. Find Names of Enterprise crewmembers who are in Stellar Cartography: In this example, we're only looking for the name, and that's B. F = C is a requirement, because we need to find Enterprise crew members AND they are in the Stellar Cartography Department. An alternate representation of the previous example would be: In this example, the value of the requested F domain is directly placed in the formula and the C domain variable is re-used in the query for the existence of a department, since it already holds a crew member's id.

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer: Get 4 months of Scribd and The New York Times for just $1.87 per week!

Master Your Semester with a Special Offer from Scribd & The New York Times