This action might not be possible to undo. Are you sure you want to continue?
1. Starter 2 2. OLAP Example 3 3. Comparison of OLAP and OLTP 4 4. Storing Active OLAP data 4 5. Multi-dimension OLAP(MOLAP) 5 6. Relational OLAP (ROLAP) 5 7. Star schema for OLAP 5 8. Aggregation 10 9. Implementing techniques in OLAP 14 10. Performance improving techniques 14
SUSHIL KULKARNI 1. Starter
A major issue in information processing is how to process larger and larger databases, containing increasingly complex data, without sacrificing response time. The client/server architecture gives organizations the opportunity to deploy specialized servers, which are optimized for handling specific data management problems. Until recently, organizations have tried to target relational database management systems (RDBMSs) for the complete spectrum of database applications. It is however apparent that there are major categories of database applications which are not suitably serviced by relational database systems. Oracle, for example, has built a totally new Media Server for handling multimedia applications. Sybase uses an object-oriented DBMS (OODBMS) in its Gain Momentum product which is designed to handle complex data such as images and audio. Another category of applications is that of on-line analytical processing (OLAP). OLAP was a term coined by E F Codd (1993) and was defined by him as: the dynamic synthesis, analysis and consolidation of large volumes of multidimensional data Codd has developed rules or requirements for an OLAP system: o o o o o o o o o o o o multidimensional conceptual view transparency accessibility consistent reporting performance client/server architecture generic dimensionality dynamic sparse matrix handling multi-user support unrestricted cross dimensional operations intuitative data manipulation flexible reporting unlimited dimensions and aggregation levels
An alternative definition of OLAP has been supplied by Nigel Pendse, who unlike Codd, does not mix technology prescriptions with application requirements. Pendse defines OLAP as, Fast Analysis of Shared Multidimensional Information Which means: Fast in that users should get a response in seconds and so doesn't lose their chain of thought; Analysis in that the system can provide analysis functions in an intuitive manner and that the functions should supply business logic and statistical analysis relevant to the users application; Shared from the point of view of supporting multiple users concurrently;
Multidimensional as a main requirement so that the system supplies a multidimensional conceptual view of the data including support for multiple hierarchies; Information is the data and the derived information required by the user application. One question is what is multidimensional data and when does it becomes OLAP? It is essentially a way to build associations between dissimilar pieces of information using predefined business rules about the information you are using. Following three components are identified to OLAP, o A multidimensional database must be able to express complex business calculations very easily. The data must be referenced and mathematics defined. In a relational system there is no relation between line items, which makes it very difficult to express business mathematics. o Intuitative navigation in order to `roam around' data which requires mining hierarchies. o Instant response i.e. the need to give the user the information as quick as possible. Dimensional databases are not without problem as they are not suited to storing all types of data such as lists for example customer addresses and purchase orders etc. Relational systems are also superior in security, backup and replication services, as these tend not to be available at the same level in dimensional systems. The advantages of a dimensional system are the freedom they offer in that the user is free to explore the data and receive the type of report they want without being restricted to a set format. We can also define alternatively OLAP as follows: ‘OLAP applications and tools are those that are designed to ask ad hoc, complex queries of large multidimensional collections of data. It is for this reason that OLAP is often mentioned in the context of Data Warehouses’. 2. OLAP Example An example OLAP database may be comprised of sales data which has been aggregated by region, product type, and sales channel. A typical OLAP query might access a multi-gigabyte/multi-year sales database in order to find all product sales in each region for each product type. After reviewing the results, an analyst might further refine the query to find sales volume for each sales channel within region/product classifications. As a last step the analyst might want to perform year-to-year or quarter-toquarter comparisons for each sales channel. This whole process must be carried out on-line with rapid response time so that the analysis process is undisturbed. OLAP queries can be characterized as on-line transactions which: o Access very large amounts of data, e.g. several years of sales data.
o Analyse the relationships between many types of business elements e.g. sales, products, regions, channels. o Involve aggregated data e.g. sales volumes, budgeted dollars and dollars spent. o Compare aggregated data over hierarchical time periods e.g. monthly, quarterly, yearly. o Present data in different perspectives e.g. sales by region vs. sales by channels by product within each region. o Involve complex calculations between data elements e.g. expected profit as calculated as a function of sales revenue for each type of sales channel in a particular region. o Are able to respond quickly to user requests so that users can pursue an analytical thought process without being stymied by the system. 3. Comparison of OLAP and OLTP OLAP applications are quite different from On-line Transaction Processing (OLTP) applications, which consist of a large number of relatively simple transactions. The transactions usually retrieve and update a small number of records that are contained in several distinct tables. The relationships between the tables are generally simple. A typical customer order entry OLTP transaction might retrieve all of the data relating to a specific customer and then insert a new order for the customer. Information is selected from the customer, customer order, and detail line tables. Each row in each table contains a customer identification number which is used to relate the rows from the different tables. The relationships between the records are simple and only a few records are actually retrieved or updated by a single transaction. The difference between OLAP and OLTP has been summarized as, OLTP servers handle mission-critical production data accessed through simple queries; while OLAP servers handle management-critical data accessed through an iterative analytical investigation. Both OLAP and OLTP have specialized requirements and therefore require special optimized servers for the two types of processing. 4. Storing Active OLAP data The 'store' in this context means holding the data in a persistent form (for at least the duration of a session, and often shared between users), not simply for the time required to process a single query. Relational database This is an obvious choice, particularly if the data is sourced from an RDBMS. In most cases, the data would be stored in a denormalised structure such as a star schema, or one of its variants, such as snowflake; a normalised database would not be appropriate for performance and other reasons. Often, summary data will be held in aggregate tables. firstname.lastname@example.org
Multidimensional database In this case, the active data is stored in a multidimensional database on a server. It may include data extracted and summarised from legacy systems or relational databases and from end-users. It is usually possible (and sometimes compulsory) for data to be pre-computed and the results stored in some form of array structure. Client-based files In this case, relatively small extracts of data are held on client machines. They may be distributed in advance, or created on demand (possibly via the Web). 5. Multi-dimension OLAP(MOLAP) MOLAP tools use specialised data structures and multi-dimensional DBMSs to organize, navigate and analyze data. To enhance query performance the data is typically aggregated and stored according to predicted usage.The development issues associated with MOLAP are: o The underlying data structures are limited in their ability to support multiple subject areas and to provide access to detailed data. o Navigation and analysis of data is limited because the data is designed according to previously determined requirements. Data may need to be physically reorganised to optimally support new requirements. o MOLAP products require a different set of skills and tools to build and maintain the database, thus increasing the cost and complexity of support. 6. Relational OLAP (ROLAP) ROLAP supports RDBMS products through the use of a metadata layer, thus avoiding the requirement to create a static multi-dimensional data structure. This facilitates the creation of multiple multidimensional views of the two-dimensional relation. To improve performance, some ROLAP products have enhanced SQL engines to support the complexity of multi-dimensional analysis, while others recommend, or require the use of highly denormalised database designs such as the star schema. The development issues associated with ROLAP technology are: o Development of middleware to facilitate the development of multi-dimensional applications; that is, software that converts the two-dimensional relation into a multi-dimensional structure. o Development of an option to create persistent, multidimensional structures with facilities to assist in the administration of these structures. 7. Star schema for OLAP Star schema is a data modeling technique used to map multidimensional decision support data into relational databases. The basic star schema has four components: facts, dimensions, email@example.com
attributes and hierarchies. This schema contains a single object (fact table) in the middle connected to a number of dimension tables We will see these components in detail and then visualized star schema. In a multidimensional data model, there is a set of numeric measures called facts that are the objects of analysis. Examples of such measures are sales, budget, revenue, and inventory, ROI (return on investment). Each of the numeric measures depends on a set of dimensions, which provide the context for the measure. For example, the dimensions associated with a sale amount can be the city, product name, and the date when the sale was made. The dimensions together are assumed to uniquely determine the measure. Thus, the multidimensional data views a measure as a value in the multidimensional space of dimensions. Each dimension is described by a set of attributes. For example, the Product dimension may consist of four attributes: the category and the industry of the product, year of its introduction, and the average profit margin. For example, the soda Surge belongs to the category beverage and the food industry, was introduced in 1996, and may have an average profit margin of 80%. The attributes of a dimension may be related via a hierarchy of relationships. In the above example, the product name is related to its category and the industry attribute through such a hierarchical relationship. Facts and dimensions are inserted into tables and are called fact and dimension tables. Let us create a fact table for a supermarket application that is based on a table Sales (Market_Id, Product_Id, Time_Id, Sales_Amt) A popular conceptual model that influences database design and the query engines for OLAP is the multidimensional view of data. This table can be viewed as multidimensional as the first three columns are the dimensions representing specific supermarkets, products and time intervals. The fourth column, the Sales_Amt, is a function of the other three. The fact table can be viewed as a three-dimensional cube and the entries of the cube are taken from fact table. For example, the entries in the cube are the Sales_Amts with three dimensions Market_Id, Product_Id, Time_Id . The dimensions of the fact table can be further described with dimension tables For example, the fact table Sales (Market_id, Product_Id, Time_Id, Sales_Amt) firstname.lastname@example.org
SUSHIL KULKARNI Can have dimension tables as Market (Market_Id, City, State, Region) Product (Product_Id, Name, Category, Price) Time (Time_Id, Week, Month, Quarter)
The fact and dimension relations can be displayed in an E-R diagram, which suggests a star and is called a star schema. For the above example, we have the following star schema
Consider the following star schema having dimensional tables: Store(store id, name, address) Product(prod id, name, category) Date(time id, day, month, year) Sales(time id, store id, prod id, units sold)
SUSHIL KULKARNI Facts in the cube are the actual values of sales relation. Each dimension can have set of associated attributes:
store(store id, name, city, state, country) date(time id, date, week, month, quarter, year, holiday flag) Each such a dimension can be structured as a hierarchy.
Let us consider another example, of a schema given by
This schema can be viewed with the instances as follows:
From the schema we have the following: 1. Relation, which relates the dimensions to the measure of interest, is called the fact table (e.g. sale) 2. Information about dimensions can be represented as a collection of relations –called the dimension tables (product, customer, store) 3. Each dimension can have a set of associated attributes Thus we can have the following star schema in which we have added one more component called time, as the sales of the store depends upon time
For each dimension, the set of associated attributes can be structured as a hierarchy. For example, we have the following hierarchy structure for store and customer
Let us see how to add the values in the cube. Consider again, sales of products that can be represented in one dimension (as a fact relation) or in two dimensions as clients and products See the following tables:
In three dimensions we have to take the following fact table with corresponding cube with values
8. Aggregation Many OLAP queries involve aggregation of the data in the fact table. Aggregate operators can be sum, count, max, min, median, ave etc.
For example, to find the total amount for each day, we might use the following SQL statement:
SELECT Date, sum(Amt) FROM SALE GROUP BY Date The output is as follows:
Let us consider another query to find the total amount of each product for each client, we might use the following SQL statement: SELECT client, product, sum(amt) FROM SALE GROUP BY client, product The aggregation is over the entire time dimension and thus produces a two-dimensional view of the data as follows In multidimensional data model together with measure values usually we store summarizing information i.e an aggregates value. This is shown in the following figure:
Computing the sum aggregate operation becomes very easy. Let us write the above values in a cube and then find the sum with row and column wise as shown in the following figure:
These values can be inserted into a cube as shown below:
One can find the average sells by region within store or by months within date using dimension hierarchy. Thus one can find “how many products are sold according to the regions?” For instance suppose customer c1 belongs to region1 and c2, c3 belongs to region 2 then we get the following:
One can also find the aggregate sells by city using dimension hierarchy as the following:
Thus, finally we can incorporate all the entries in the cube as shown in the following figure
9. Implementing techniques in OLAP OLAP applications are characterized by a very large amount of data that is relatively static, with infrequent updates. Thus, various aggregations can be pre-computed and stored in the database. Following are different index structures used to improve efficiency. o o bit map indexes join indexes
Bit map indexes are performed on a particular column. Index consists of a number of bit vectors called bitmaps. Each value in the indexed column has a bit vectors (bitmaps) The length of the bit vector is the number of records in the base table. The i-th bit is set if the i-th row of the base table has the value for the indexed column. Traditional join indexes map the value to a list of record ids. Join indexes map the tuples in the join result of two relations to the source tables. In the example, join indexes relate the values of the dimensions of a star schema to rows in the fact table. For a warehouse with a Sales fact table and dimension city, a join index on city maintains for each distinct city a list of RIDs of the tuples recording the sales in the city. Join indexes can span multiple dimensions. 10. Performance improving techniques To get an accurate as well as fast answer from the database, there is need to fire the effective query. This can be achieved using following changes that can be made: o Dimensional table can be normalize email@example.com
SUSHIL KULKARNI o Multiple fact tables can be given different aggregation level. o Fact tables can be denormalized o Tables can be partitioned and replicated
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue listening from where you left off, or restart the preview.