Lec 12 - Physical Design

DATA WAREHOUSING
Physical Design
Lecture 11
Reference
Chapter 12 of Recomended Book
THE READER GROUP OF COLLEGES

OUTLINE
• Optimizer
• Splitting a Database into Tablespaces
• Allocating Data Files
• Disk Block Size

Physical Design
The physical design of data warehouse is the process of defining how
the data will be stored and organized in the databse
this include optmizing, index selection and allocation strategies.
physical design is important as it can have significant impact on the
performance and scalability of data warehouse.

Optimizer
Optimizer
Physical Design of data warehouse is important because it can have
significant impact on the performance of data warehouse
an optimizer is responsible for evaluating and comparing query
execution plan
query execution plan is a sequence of operations that the DBMS
performs to answer a query

Optimizer
The following is a simplified example of a query execution plan:
SELECT customer_name, order_total
FROM customers JOIN orders ON customers.customer_id = orders.customer_id
WHERE order_total > 100
ORDER BY customer_name

Optimizer
Query execution plan for this query might look like this:
Scan the customers and the orders table.
Join the customers and orders tables on the customer_id column.
Filter to only include the rows where the order_total is greater than 100.
Sort the filtered table by the customer_name column.
Project the sorted table to only include the customer_name and order_total
columns.
optimizer will choose query execution plan that believes most efficient for
executing the query
Optimizer
• Rule-Based Optimizer
• Cost Based Optimizer
• Histograms

Rule Based Optimizer
Uses a set of rules to generate a query execution plan. The rules are based
on the data structure, query structure, and indexes available. Rule-based
optimizers do not use any statistical information about the data.
Example: SELECT customer_name, order_total FROM customers JOIN
orders ON customers.cust_id = orders.cust_id WHERE order_total > 100
The optimizer would first check to see if there is an index on the order_total
column. If there is, then the optimizer would use the index to filter the joined
table. Otherwise, the optimizer would have to scan the entire joined table to
filter out the rows where the order_total column is less than or equal to 100.

Strengths:
• simple to implement and understand.
• not affected by changes to the data, such as updates and inserts.
Weaknesses:
• not scaleable.
• low performance for larger data set

Overall, rule-based optimizers are a good choice for
applications for smaller data and where the queries are
relatively simple. However, for applications where performance
is critical or where the queries are complex, a cost-based
optimizer is a better choice.

Cost Based Optimizer
Cost-based optimizers are a type of query optimizer that uses statistical
information about the data to estimate the cost of different query execution
plans. They then choose the plan that is estimated to have the lowest cost.
The statistical information that cost-based optimizers use is typically stored in
catalogs including cardinality of tables and attributes.

Example: SELECT customer_name, order_total FROM customers JOIN
orders ON customers.cus_id = orders.cus_id WHERE order_total > 100
Optimizer first collect statistical information including cardinality of each table

to estimate the cost of different execution plans. i.e. optimizer might estimate
cost of following plans
=> Nested Loop Join: scan both tables and compare customer_id in each
row to find matching rows.
=> Hash Join: create hast table on customer_id in one table and then scan
the other table in order to find matching rows
Strengths:
• Generate efficient query execution plans, including complex queries
involving larger datasets.
• take advantage of new features in the DBMS, i.e. new data types.
Weaknesses:
• complex to implement and understand.

Histogram
Histogram provides an approximate representation of the distribution of
attribute values. to obtain this representation, we need to split the range of
possible attribute values into several intervals (called bucket) and evaluate
for each of them how often interval values appear in relations.

Histogram

Histogram
Histogram can be used by Data Warehouse as optimizer to generate more
efficient query execution plans.
histogram estimate number or rows that will be returned by the query.
histogram also reduces load by reducing number of rows that need to be
scanned by query optimizer.
histogram also improve scalability of database

Splitting a Database
into Tablespaces
Basic elements of physical design
Tablespaces: logical subdivision of the disk space used by a database. A
tablespace must store uniform data sets. e.g. sales tablespace, product table
space, customer tablespace
Data Files: also called segment, a file storing part of information on
tablespace. a data file belongs to a single tablespace, a number of data files
may be associated with one tablespace. e.g. sales1.dbf, sales2.dbf
Disk Blocks: information unit read or written by DBMSs. which is set at the
time the database is created and can’t be changed later. e,g, typical disk
block size for a data warehouse is 8kb or 16kb.

Splitting a database into Tablespaces
Altough all data of database theoretically stored to one single tablespace, but
still it is recomended that split information sensibly into smaller bits.
this improve performance and fault tolerance.
you can use CREATE TABLESPACE statement to create a tablespcae

Allocating Data Files
process of distributing your data across multiple disks. This can improve
performance and increase fault tolerance
improve performance: allocating data files to multiple disks, database can

read and write data to multiple disks in parallel, hence significantly improve
performance for queries
Fault Tolerance: one of your harddisks fails, you will only lose data that is
stored on that disk, however if data files are allocated to multiple disks, the
rest of your data will be safe.

Allocating Data Files (Data Striping)
technique for distributing data across multiple disks. it works by splitting the
data into smaller blocks, or strips and storing each stripe on different disk
this technique is called data stiping

Allocating Data Files (Data Striping)
Move the tables into the tablespaces. using ALTER TABLE statement

Disk Block Size
specifying the size of disk in blocks - number of bytes to be read/write
atomically at each disk access.
normally ranges from 0.5KB to 256KB.

Lec 12 - Physical Design

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec 12 - Physical Design

Uploaded by

Copyright:

Available Formats

DATA WAREHOUSING

THE READER GROUP OF COLLEGES

THE READER GROUP OF COLLEGES

THE READER GROUP OF COLLEGES

THE READER GROUP OF COLLEGES

THE READER GROUP OF COLLEGES

THE READER GROUP OF COLLEGES

THE READER GROUP OF COLLEGES

THE READER GROUP OF COLLEGES

THE READER GROUP OF COLLEGES

THE READER GROUP OF COLLEGES

Optimizer first collect statistical information including cardinality of each table

THE READER GROUP OF COLLEGES

THE READER GROUP OF COLLEGES

THE READER GROUP OF COLLEGES

THE READER GROUP OF COLLEGES

THE READER GROUP OF COLLEGES

THE READER GROUP OF COLLEGES

improve performance: allocating data files to multiple disks, database can

THE READER GROUP OF COLLEGES

THE READER GROUP OF COLLEGES

THE READER GROUP OF COLLEGES

THE READER GROUP OF COLLEGES

You might also like