You are on page 1of 30

Designing High Performance Cubes In Analysis Services 2008

Esendal Yasin Lead Architect Microsoft MEA HQ

Agenda
Dimension Design Cube Design Partitioning Aggregations Scale-out

Attribute Relationships
1 M relationships between attributes Server simply works better Examples
City State, State Country Day Month, Month Quarter, Quarter Year Product Subcategory Product Category

Rigid versus flexible relationships


Flexible (Default)
Customer City Customer PhoneNo

Rigid
Customer BirthDate City State

All attributes implicitly related to key attribute


3

Attribute Relationships
Illustration
Country

State

City

Gender

Marital

Age

Customer

Attribute Relationships
Where are they used
Storage
Query performance
Greatly improved effectiveness of in-memory caching Materialized hierarchies when present

Processing performance: Fewer, smaller hash tables result in faster, less memory intensive processing Aggregation design: Algorithm needs relationships in order to design effective aggregations Member properties: Attribute relationships identify member properties on levels

Natural Hierarchies
1:M relation (via attribute relationships) between every pair of adjacent levels Examples
Country-State-City-Customer (natural) Country-City (natural) Age-Gender-Customer (unnatural) Year-Quarter-Month (depends on key columns)
How many quarters and months? 4 & 12 across all years (unnatural) 4 & 12 for each year (natural)
6

Natural Hierarchies
Performance implications
Only natural hierarchies are materialized on disk during processing Unnatural hierarchies are built on the fly during queries (and cached in memory) Server internally decomposes unnatural hierarchies into natural components
Essentially operates like ad hoc navigation path (but somewhat better)

Aggregation designer favors user defined hierarchies


7

Large Dimensions
Optimizing Processing
Use natural hierarchies
Good attribute/hierarchy relationships forces the AS engine to build smaller DISTINCT queries versus one large and expensive query Consider size of other properties/attributes

Dimension SQL queries are in the form of select distinct Key1, Key2, Name, , RelKey1, RelKey2, from [DimensionTable]

Important to tune your SQL statements


Indexes to underlying tables Create a separate table for dimensions Avoid OPENROWSET queries Use Views to create your own version of query binding

Size limitations for string stores and effect on dimension size


4 GB, stored in Unicode, 6 byte per-string overhead. E.g. 50-character name: 4*1024*1024*1024 / (6+50*2) = 40.5 million members

Dimension Processing
Byattribute vs. Bytable
ProcessingGroup property on each Dimension
By default it is set to ByAttribute This is usually the best choice

What is ByTable?
Sends one SQL query per dimension table Can result in better performance in limited cases Memory requirements and resource cost much higher for AS!

Example
Two dimensions each with >25M members ByTable
Took 80% of available memory (25.6GB out of 32GB)

ByAttribute
Only took 9GB out of 32GB

Agenda
Dimension Design Cube Design Partitioning Aggregations Scale-out

10

Cube Dimensions
Dimensions
Consolidate multiple hierarchies into single dimension (unless they are related via fact table) Use role-playing dimensions (e.g., OrderDate, BillDate, ShipDate)avoids multiple physical copies Use parent-child dimensions prudently
No aggregation support

Set Materialized = true on reference dimensions Use many-to-many dimensions prudently


Slower than regular dimensions, but faster than calculations Intermediate measure group must be small relative to primary measure group Consider creating aggregations on the shared common attributes of the intermediate measure group
11

Measure Groups
Common questions
At what point do you split from a single cube and create one or more additional cubes? How many is too many?

Why is this important?


New measure groups adding new dimensions result in an expansion of the cube space Larger calculation space = more work for the engine when evaluating calculations

Guidance
Look at increase in dimensionality. If significant, and overlap with other measure groups is minimal, consider a separate cube Will users want to analyze measures together? Will calculations need to reference unified measures collection?
12

AMO Design Warnings


~60 best practice rules integrated via real-time designer checks Think of it as auto BPA while you develop Subtle
Blue squiggly lines and build time warnings No pop-ups to get in your way

Dismissible:
By instance or globally Can specify comment in each case

13

Agenda
Dimension Design Cube Design Partitioning Aggregations Scale-out

14

Partitioning
Mechanism to break up large cubes into manageable chunks Measure groups can be partitioned, but not dimensions Measure group can have one or more partitions Fact rows are distributed across partitions as per partitioning scheme Partitioning scheme is managed by DBA (server is unaware) Examples
By Time: Sales for 2001, 2002, 2003, By Geography: Sales for North America, Europe, Asia,
15

Benefits Of Partitioning
Partitions can be added, processed, deleted independently
Update to last months data does not affect prior months partitions Sliding window scenario easy to implement e.g., 24 month window add June 2006 partition and delete June 2004

Partitions can have different storage settings


Storage mode (MOLAP, ROLAP, HOLAP) Aggregation design Alternate disk drive Remote server
16

Benefits Of Partitioning
Partitions can be processed and queried in parallel
Better utilization of server resources Reduced data warehouse load times

Queries are isolated to relevant partitions less data to scan


SELECT FROM WHERE *Time+.*Year+.*2006+ Queries only 2006 partitions

Bottom line partitions enable


Manageability Performance Scalability
17

Best Practices For Partitioning


General guidance: 20M rows per partition
Use judgment, e.g., perhaps better to have 500 partitions with 40 million rows than 1000 20 million row partitions Standard tools unable to manage thousands of partitions

More partitions means more files


E.g. one 10GB cube with ~250,000 files (design issues) Deletion of database took ~25min to complete

Partition by time plus another dimension e.g. Geography


Limits amount of reprocessing Use query patterns to pick another partitioning attribute

When data changes


All data cache for the measure group is discarded Separate cube or measure groups by static and real-time analysis

18

Best Practices For Partitioning


Equal sized partitions
Equal Sized Partitions 11 20
21 30 31 40 41 50 January 2008

Not Equal Sized Partitions 11 15 16 20 21 25


26 50

19

Partition Slices
Defining partition slices allows server to avoid scanning partitions
E.g. Partition { [January 2008] } not scanned for query to [February 2008]

Simple partition slices are automatically detected for MOLAP partitions


You have to explicitly specify slices for ROLAP partitions Non-trivial slices (e.g. a partition with data for more than one month) will not be automatically set you can still set them explicitly

What if I get the error The slice specified for attribute_name attribute is incorrect"
MOLAP processing validates partition data against the defined slice Error raised if data received from source does not match slice specified E.g. Slice = [January] but partition query incorrectly defined to be SELECT FROM facts WHERE Month = February
20

Agenda
Dimension Design Cube Design Partitioning Aggregations Scale-out

21

What Is An Aggregation?
A subtotal of partition data based on a set of attributes from each dimension
Highest-Level Aggregation
Customer All Product All Units Sold 347814123 Sales $345,212,301.30

Customers
All Customers Country State City Name

Products
All Products Category Brand Item SKU

Intermediate Aggregation
countryCode Can US productID sd452 yu678 Units Sold 9456 4623 Sales $23,914.30 $57,931.45

Facts
custID 345-23 563-01 SKU 135123 451236 Units Sold 2 34 Sales $45.67 $67.32

22

Aggregations Help Query Performance


Customers Products
Time

All Customers Country State City Name


Query levels
(All, All, All) (Country, Item, Quarter) (Country, Brand, Quarter) (Country, Category, All) (State, Item, Quarter) (City, Category, Year)

All Products Category Brand Item SKU


Aggregation used

All Time Year Quarter Month Day


Max Cells

Using a higher-level aggregation means fewer cells to consider


23

(All, All, All) 1 (Country, Item, Quarter) 274,356 (Country, Item, Quarter) 274,356 (Country, Item, Quarter) 274,356 (Name, SKU, Day) 34,264,872,495 (Name, SKU, Day) 34,264,872,495

Best Practices For Aggregations


Define all possible attribute relationships Set accurate attribute member counts and fact table counts Set AggregationUsage to guide agg designer
Set rarely queried attributes to None Set commonly queried attributes to Unrestricted

Do not build too many aggregations


In the 100s, not 1000s!

Do not build aggregations larger than 30% of fact table size (aggregation design algorithm doesnt)

24

Best Practices For Aggregations


Aggregation design cycle
Use Storage Design Wizard (~20% perf gain) to design initial set of aggregations Enable query log and run pilot workload (beta test with limited set of users) Use Usage Based Optimization (UBO) Wizard to refine aggregations Periodically use UBO to refine aggregations

25

Agenda
Dimension Design Cube Design Partitioning Aggregations Scale-out

26

Scalable Shared Databases


Ask/Need Today's Problem
Easy way of scaling out AS data cross multiple machines. While MOLAP cubes are Read-Only databases, no two servers are share same data directory. Cube Sync works but has latency issues which are not acceptable in load balanced solutions.

AS 2008 Solution

Single read-only copy database is shared between several Analysis Servers.


Virtual IP

Analysis Server

Analysis Server

SAN storage

.. .

Analysis Server

27

Scalable Shared Databases In Practice


Clients

Load Balancer NLB, F5, Custom ASP.NET

Processing Server

Query Servers

Cube Processing Detach & Attach 28

Read Only DB on Shared SAN Drive

Enabling The SSD Scenario


DBStorageLocation
Ability to store database outside server data folder SAN drive, NAS network share, flash/SSD

Attach/Detach
Ability to attach/detach a database Attach from any location Attach as read-only or read-write Multiple instances can attach as read-only (shared) Only one instance can attach as read-write (exclusive)

Read Only Enforcement


Disallow all update operations (processing, writeback, restore, etc.) Disable lazy processing, proactive caching Allow loading database from read-only media

29

30