You are on page 1of 30

Designing High Performance Cubes In Analysis Services 2008

Esendal Yasin Lead Architect Microsoft MEA HQ

Dimension Design Cube Design Partitioning Aggregations Scale-out

Attribute Relationships
1 M relationships between attributes Server simply works better Examples
City State, State Country Day Month, Month Quarter, Quarter Year Product Subcategory Product Category

Rigid versus flexible relationships

Flexible (Default)
Customer City Customer PhoneNo

Customer BirthDate City State

All attributes implicitly related to key attribute


Attribute Relationships







Attribute Relationships
Where are they used
Query performance
Greatly improved effectiveness of in-memory caching Materialized hierarchies when present

Processing performance: Fewer, smaller hash tables result in faster, less memory intensive processing Aggregation design: Algorithm needs relationships in order to design effective aggregations Member properties: Attribute relationships identify member properties on levels

Natural Hierarchies
1:M relation (via attribute relationships) between every pair of adjacent levels Examples
Country-State-City-Customer (natural) Country-City (natural) Age-Gender-Customer (unnatural) Year-Quarter-Month (depends on key columns)
How many quarters and months? 4 & 12 across all years (unnatural) 4 & 12 for each year (natural)

Natural Hierarchies
Performance implications
Only natural hierarchies are materialized on disk during processing Unnatural hierarchies are built on the fly during queries (and cached in memory) Server internally decomposes unnatural hierarchies into natural components
Essentially operates like ad hoc navigation path (but somewhat better)

Aggregation designer favors user defined hierarchies


Large Dimensions
Optimizing Processing
Use natural hierarchies
Good attribute/hierarchy relationships forces the AS engine to build smaller DISTINCT queries versus one large and expensive query Consider size of other properties/attributes

Dimension SQL queries are in the form of select distinct Key1, Key2, Name, , RelKey1, RelKey2, from [DimensionTable]

Important to tune your SQL statements

Indexes to underlying tables Create a separate table for dimensions Avoid OPENROWSET queries Use Views to create your own version of query binding

Size limitations for string stores and effect on dimension size

4 GB, stored in Unicode, 6 byte per-string overhead. E.g. 50-character name: 4*1024*1024*1024 / (6+50*2) = 40.5 million members

Dimension Processing
Byattribute vs. Bytable
ProcessingGroup property on each Dimension
By default it is set to ByAttribute This is usually the best choice

What is ByTable?
Sends one SQL query per dimension table Can result in better performance in limited cases Memory requirements and resource cost much higher for AS!

Two dimensions each with >25M members ByTable
Took 80% of available memory (25.6GB out of 32GB)

Only took 9GB out of 32GB

Dimension Design Cube Design Partitioning Aggregations Scale-out


Cube Dimensions
Consolidate multiple hierarchies into single dimension (unless they are related via fact table) Use role-playing dimensions (e.g., OrderDate, BillDate, ShipDate)avoids multiple physical copies Use parent-child dimensions prudently
No aggregation support

Set Materialized = true on reference dimensions Use many-to-many dimensions prudently

Slower than regular dimensions, but faster than calculations Intermediate measure group must be small relative to primary measure group Consider creating aggregations on the shared common attributes of the intermediate measure group

Measure Groups
Common questions
At what point do you split from a single cube and create one or more additional cubes? How many is too many?

Why is this important?

New measure groups adding new dimensions result in an expansion of the cube space Larger calculation space = more work for the engine when evaluating calculations

Look at increase in dimensionality. If significant, and overlap with other measure groups is minimal, consider a separate cube Will users want to analyze measures together? Will calculations need to reference unified measures collection?

AMO Design Warnings

~60 best practice rules integrated via real-time designer checks Think of it as auto BPA while you develop Subtle
Blue squiggly lines and build time warnings No pop-ups to get in your way

By instance or globally Can specify comment in each case


Dimension Design Cube Design Partitioning Aggregations Scale-out


Mechanism to break up large cubes into manageable chunks Measure groups can be partitioned, but not dimensions Measure group can have one or more partitions Fact rows are distributed across partitions as per partitioning scheme Partitioning scheme is managed by DBA (server is unaware) Examples
By Time: Sales for 2001, 2002, 2003, By Geography: Sales for North America, Europe, Asia,

Benefits Of Partitioning
Partitions can be added, processed, deleted independently
Update to last months data does not affect prior months partitions Sliding window scenario easy to implement e.g., 24 month window add June 2006 partition and delete June 2004

Partitions can have different storage settings

Storage mode (MOLAP, ROLAP, HOLAP) Aggregation design Alternate disk drive Remote server

Benefits Of Partitioning
Partitions can be processed and queried in parallel
Better utilization of server resources Reduced data warehouse load times

Queries are isolated to relevant partitions less data to scan

SELECT FROM WHERE *Time+.*Year+.*2006+ Queries only 2006 partitions

Bottom line partitions enable

Manageability Performance Scalability

Best Practices For Partitioning

General guidance: 20M rows per partition
Use judgment, e.g., perhaps better to have 500 partitions with 40 million rows than 1000 20 million row partitions Standard tools unable to manage thousands of partitions

More partitions means more files

E.g. one 10GB cube with ~250,000 files (design issues) Deletion of database took ~25min to complete

Partition by time plus another dimension e.g. Geography

Limits amount of reprocessing Use query patterns to pick another partitioning attribute

When data changes

All data cache for the measure group is discarded Separate cube or measure groups by static and real-time analysis


Best Practices For Partitioning

Equal sized partitions
Equal Sized Partitions 11 20
21 30 31 40 41 50 January 2008

Not Equal Sized Partitions 11 15 16 20 21 25

26 50


Partition Slices
Defining partition slices allows server to avoid scanning partitions
E.g. Partition { [January 2008] } not scanned for query to [February 2008]

Simple partition slices are automatically detected for MOLAP partitions

You have to explicitly specify slices for ROLAP partitions Non-trivial slices (e.g. a partition with data for more than one month) will not be automatically set you can still set them explicitly

What if I get the error The slice specified for attribute_name attribute is incorrect"
MOLAP processing validates partition data against the defined slice Error raised if data received from source does not match slice specified E.g. Slice = [January] but partition query incorrectly defined to be SELECT FROM facts WHERE Month = February

Dimension Design Cube Design Partitioning Aggregations Scale-out


What Is An Aggregation?
A subtotal of partition data based on a set of attributes from each dimension
Highest-Level Aggregation
Customer All Product All Units Sold 347814123 Sales $345,212,301.30

All Customers Country State City Name

All Products Category Brand Item SKU

Intermediate Aggregation
countryCode Can US productID sd452 yu678 Units Sold 9456 4623 Sales $23,914.30 $57,931.45

custID 345-23 563-01 SKU 135123 451236 Units Sold 2 34 Sales $45.67 $67.32


Aggregations Help Query Performance

Customers Products

All Customers Country State City Name

Query levels
(All, All, All) (Country, Item, Quarter) (Country, Brand, Quarter) (Country, Category, All) (State, Item, Quarter) (City, Category, Year)

All Products Category Brand Item SKU

Aggregation used

All Time Year Quarter Month Day

Max Cells

Using a higher-level aggregation means fewer cells to consider


(All, All, All) 1 (Country, Item, Quarter) 274,356 (Country, Item, Quarter) 274,356 (Country, Item, Quarter) 274,356 (Name, SKU, Day) 34,264,872,495 (Name, SKU, Day) 34,264,872,495

Best Practices For Aggregations

Define all possible attribute relationships Set accurate attribute member counts and fact table counts Set AggregationUsage to guide agg designer
Set rarely queried attributes to None Set commonly queried attributes to Unrestricted

Do not build too many aggregations

In the 100s, not 1000s!

Do not build aggregations larger than 30% of fact table size (aggregation design algorithm doesnt)


Best Practices For Aggregations

Aggregation design cycle
Use Storage Design Wizard (~20% perf gain) to design initial set of aggregations Enable query log and run pilot workload (beta test with limited set of users) Use Usage Based Optimization (UBO) Wizard to refine aggregations Periodically use UBO to refine aggregations


Dimension Design Cube Design Partitioning Aggregations Scale-out


Scalable Shared Databases

Ask/Need Today's Problem
Easy way of scaling out AS data cross multiple machines. While MOLAP cubes are Read-Only databases, no two servers are share same data directory. Cube Sync works but has latency issues which are not acceptable in load balanced solutions.

AS 2008 Solution

Single read-only copy database is shared between several Analysis Servers.

Virtual IP

Analysis Server

Analysis Server

SAN storage

.. .

Analysis Server


Scalable Shared Databases In Practice


Load Balancer NLB, F5, Custom ASP.NET

Processing Server

Query Servers

Cube Processing Detach & Attach 28

Read Only DB on Shared SAN Drive

Enabling The SSD Scenario

Ability to store database outside server data folder SAN drive, NAS network share, flash/SSD

Ability to attach/detach a database Attach from any location Attach as read-only or read-write Multiple instances can attach as read-only (shared) Only one instance can attach as read-write (exclusive)

Read Only Enforcement

Disallow all update operations (processing, writeback, restore, etc.) Disable lazy processing, proactive caching Allow loading database from read-only media