You are on page 1of 25

DATA WAREHOUSING

Physical Design
Lecture 11
Reference
Chapter 12 of Recomended Book

THE READER GROUP OF COLLEGES


OUTLINE
• Optimizer
• Splitting a Database into Tablespaces
• Allocating Data Files
• Disk Block Size

THE READER GROUP OF COLLEGES


Physical Design
The physical design of data warehouse is the process of defining how
the data will be stored and organized in the databse
this include optmizing, index selection and allocation strategies.
physical design is important as it can have significant impact on the
performance and scalability of data warehouse.

THE READER GROUP OF COLLEGES


Optimizer
Optimizer
Physical Design of data warehouse is important because it can have
significant impact on the performance of data warehouse
an optimizer is responsible for evaluating and comparing query
execution plan
query execution plan is a sequence of operations that the DBMS
performs to answer a query

THE READER GROUP OF COLLEGES


Optimizer
The following is a simplified example of a query execution plan:
SELECT customer_name, order_total
FROM customers JOIN orders ON customers.customer_id = orders.customer_id
WHERE order_total > 100
ORDER BY customer_name

THE READER GROUP OF COLLEGES


Optimizer
Query execution plan for this query might look like this:
Scan the customers and the orders table.
Join the customers and orders tables on the customer_id column.
Filter to only include the rows where the order_total is greater than 100.
Sort the filtered table by the customer_name column.
Project the sorted table to only include the customer_name and order_total
columns.
optimizer will choose query execution plan that believes most efficient for
executing the query
THE READER GROUP OF COLLEGES
Optimizer
• Rule-Based Optimizer
• Cost Based Optimizer
• Histograms

THE READER GROUP OF COLLEGES


Rule Based Optimizer
Uses a set of rules to generate a query execution plan. The rules are based
on the data structure, query structure, and indexes available. Rule-based
optimizers do not use any statistical information about the data.
Example: SELECT customer_name, order_total FROM customers JOIN
orders ON customers.cust_id = orders.cust_id WHERE order_total > 100
The optimizer would first check to see if there is an index on the order_total
column. If there is, then the optimizer would use the index to filter the joined
table. Otherwise, the optimizer would have to scan the entire joined table to
filter out the rows where the order_total column is less than or equal to 100.

THE READER GROUP OF COLLEGES


Rule Based Optimizer
Strengths:
• simple to implement and understand.
• not affected by changes to the data, such as updates and inserts.
Weaknesses:
• not scaleable.
• low performance for larger data set

THE READER GROUP OF COLLEGES


Rule Based Optimizer
Overall, rule-based optimizers are a good choice for
applications for smaller data and where the queries are
relatively simple. However, for applications where performance
is critical or where the queries are complex, a cost-based
optimizer is a better choice.

THE READER GROUP OF COLLEGES


Cost Based Optimizer
Cost-based optimizers are a type of query optimizer that uses statistical
information about the data to estimate the cost of different query execution
plans. They then choose the plan that is estimated to have the lowest cost.
The statistical information that cost-based optimizers use is typically stored in
catalogs including cardinality of tables and attributes.

THE READER GROUP OF COLLEGES


Cost Based Optimizer
Example: SELECT customer_name, order_total FROM customers JOIN
orders ON customers.cus_id = orders.cus_id WHERE order_total > 100

Optimizer first collect statistical information including cardinality of each table


to estimate the cost of different execution plans. i.e. optimizer might estimate
cost of following plans
=> Nested Loop Join: scan both tables and compare customer_id in each
row to find matching rows.
=> Hash Join: create hast table on customer_id in one table and then scan
the other table in order to find matching rows
THE READER GROUP OF COLLEGES
Cost Based Optimizer
Strengths:
• Generate efficient query execution plans, including complex queries
involving larger datasets.
• take advantage of new features in the DBMS, i.e. new data types.
Weaknesses:
• complex to implement and understand.

THE READER GROUP OF COLLEGES


Histogram
Histogram provides an approximate representation of the distribution of
attribute values. to obtain this representation, we need to split the range of
possible attribute values into several intervals (called bucket) and evaluate
for each of them how often interval values appear in relations.

THE READER GROUP OF COLLEGES


Histogram

THE READER GROUP OF COLLEGES


Histogram
Histogram can be used by Data Warehouse as optimizer to generate more
efficient query execution plans.
histogram estimate number or rows that will be returned by the query.
histogram also reduces load by reducing number of rows that need to be
scanned by query optimizer.
histogram also improve scalability of database

THE READER GROUP OF COLLEGES


Splitting a Database
into Tablespaces
Basic elements of physical design
Tablespaces: logical subdivision of the disk space used by a database. A
tablespace must store uniform data sets. e.g. sales tablespace, product table
space, customer tablespace
Data Files: also called segment, a file storing part of information on
tablespace. a data file belongs to a single tablespace, a number of data files
may be associated with one tablespace. e.g. sales1.dbf, sales2.dbf
Disk Blocks: information unit read or written by DBMSs. which is set at the
time the database is created and can’t be changed later. e,g, typical disk
block size for a data warehouse is 8kb or 16kb.

THE READER GROUP OF COLLEGES


Splitting a database into Tablespaces
Altough all data of database theoretically stored to one single tablespace, but
still it is recomended that split information sensibly into smaller bits.
this improve performance and fault tolerance.
you can use CREATE TABLESPACE statement to create a tablespcae

THE READER GROUP OF COLLEGES


Allocating Data Files
process of distributing your data across multiple disks. This can improve
performance and increase fault tolerance

improve performance: allocating data files to multiple disks, database can


read and write data to multiple disks in parallel, hence significantly improve
performance for queries
Fault Tolerance: one of your harddisks fails, you will only lose data that is
stored on that disk, however if data files are allocated to multiple disks, the
rest of your data will be safe.

THE READER GROUP OF COLLEGES


Allocating Data Files (Data Striping)
technique for distributing data across multiple disks. it works by splitting the
data into smaller blocks, or strips and storing each stripe on different disk
this technique is called data stiping

THE READER GROUP OF COLLEGES


Allocating Data Files (Data Striping)
Move the tables into the tablespaces. using ALTER TABLE statement

THE READER GROUP OF COLLEGES


Disk Block Size
specifying the size of disk in blocks - number of bytes to be read/write
atomically at each disk access.
normally ranges from 0.5KB to 256KB.

THE READER GROUP OF COLLEGES

You might also like