Physical Design Lecture 11 Reference Chapter 12 of Recomended Book
THE READER GROUP OF COLLEGES
OUTLINE • Optimizer • Splitting a Database into Tablespaces • Allocating Data Files • Disk Block Size
THE READER GROUP OF COLLEGES
Physical Design The physical design of data warehouse is the process of defining how the data will be stored and organized in the databse this include optmizing, index selection and allocation strategies. physical design is important as it can have significant impact on the performance and scalability of data warehouse.
THE READER GROUP OF COLLEGES
Optimizer Optimizer Physical Design of data warehouse is important because it can have significant impact on the performance of data warehouse an optimizer is responsible for evaluating and comparing query execution plan query execution plan is a sequence of operations that the DBMS performs to answer a query
THE READER GROUP OF COLLEGES
Optimizer The following is a simplified example of a query execution plan: SELECT customer_name, order_total FROM customers JOIN orders ON customers.customer_id = orders.customer_id WHERE order_total > 100 ORDER BY customer_name
THE READER GROUP OF COLLEGES
Optimizer Query execution plan for this query might look like this: Scan the customers and the orders table. Join the customers and orders tables on the customer_id column. Filter to only include the rows where the order_total is greater than 100. Sort the filtered table by the customer_name column. Project the sorted table to only include the customer_name and order_total columns. optimizer will choose query execution plan that believes most efficient for executing the query THE READER GROUP OF COLLEGES Optimizer • Rule-Based Optimizer • Cost Based Optimizer • Histograms
THE READER GROUP OF COLLEGES
Rule Based Optimizer Uses a set of rules to generate a query execution plan. The rules are based on the data structure, query structure, and indexes available. Rule-based optimizers do not use any statistical information about the data. Example: SELECT customer_name, order_total FROM customers JOIN orders ON customers.cust_id = orders.cust_id WHERE order_total > 100 The optimizer would first check to see if there is an index on the order_total column. If there is, then the optimizer would use the index to filter the joined table. Otherwise, the optimizer would have to scan the entire joined table to filter out the rows where the order_total column is less than or equal to 100.
THE READER GROUP OF COLLEGES
Rule Based Optimizer Strengths: • simple to implement and understand. • not affected by changes to the data, such as updates and inserts. Weaknesses: • not scaleable. • low performance for larger data set
THE READER GROUP OF COLLEGES
Rule Based Optimizer Overall, rule-based optimizers are a good choice for applications for smaller data and where the queries are relatively simple. However, for applications where performance is critical or where the queries are complex, a cost-based optimizer is a better choice.
THE READER GROUP OF COLLEGES
Cost Based Optimizer Cost-based optimizers are a type of query optimizer that uses statistical information about the data to estimate the cost of different query execution plans. They then choose the plan that is estimated to have the lowest cost. The statistical information that cost-based optimizers use is typically stored in catalogs including cardinality of tables and attributes.
THE READER GROUP OF COLLEGES
Cost Based Optimizer Example: SELECT customer_name, order_total FROM customers JOIN orders ON customers.cus_id = orders.cus_id WHERE order_total > 100
Optimizer first collect statistical information including cardinality of each table
to estimate the cost of different execution plans. i.e. optimizer might estimate cost of following plans => Nested Loop Join: scan both tables and compare customer_id in each row to find matching rows. => Hash Join: create hast table on customer_id in one table and then scan the other table in order to find matching rows THE READER GROUP OF COLLEGES Cost Based Optimizer Strengths: • Generate efficient query execution plans, including complex queries involving larger datasets. • take advantage of new features in the DBMS, i.e. new data types. Weaknesses: • complex to implement and understand.
THE READER GROUP OF COLLEGES
Histogram Histogram provides an approximate representation of the distribution of attribute values. to obtain this representation, we need to split the range of possible attribute values into several intervals (called bucket) and evaluate for each of them how often interval values appear in relations.
THE READER GROUP OF COLLEGES
Histogram
THE READER GROUP OF COLLEGES
Histogram Histogram can be used by Data Warehouse as optimizer to generate more efficient query execution plans. histogram estimate number or rows that will be returned by the query. histogram also reduces load by reducing number of rows that need to be scanned by query optimizer. histogram also improve scalability of database
THE READER GROUP OF COLLEGES
Splitting a Database into Tablespaces Basic elements of physical design Tablespaces: logical subdivision of the disk space used by a database. A tablespace must store uniform data sets. e.g. sales tablespace, product table space, customer tablespace Data Files: also called segment, a file storing part of information on tablespace. a data file belongs to a single tablespace, a number of data files may be associated with one tablespace. e.g. sales1.dbf, sales2.dbf Disk Blocks: information unit read or written by DBMSs. which is set at the time the database is created and can’t be changed later. e,g, typical disk block size for a data warehouse is 8kb or 16kb.
THE READER GROUP OF COLLEGES
Splitting a database into Tablespaces Altough all data of database theoretically stored to one single tablespace, but still it is recomended that split information sensibly into smaller bits. this improve performance and fault tolerance. you can use CREATE TABLESPACE statement to create a tablespcae
THE READER GROUP OF COLLEGES
Allocating Data Files process of distributing your data across multiple disks. This can improve performance and increase fault tolerance
improve performance: allocating data files to multiple disks, database can
read and write data to multiple disks in parallel, hence significantly improve performance for queries Fault Tolerance: one of your harddisks fails, you will only lose data that is stored on that disk, however if data files are allocated to multiple disks, the rest of your data will be safe.
THE READER GROUP OF COLLEGES
Allocating Data Files (Data Striping) technique for distributing data across multiple disks. it works by splitting the data into smaller blocks, or strips and storing each stripe on different disk this technique is called data stiping
THE READER GROUP OF COLLEGES
Allocating Data Files (Data Striping) Move the tables into the tablespaces. using ALTER TABLE statement
THE READER GROUP OF COLLEGES
Disk Block Size specifying the size of disk in blocks - number of bytes to be read/write atomically at each disk access. normally ranges from 0.5KB to 256KB.