You are on page 1of 42

Introduction to Netezza

Bank of America

Topics

Netezza Architecture SQL Differences Stored Procedure Differences

Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

Netezza Architecture TwinFin 12

Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

Topics

System Architecture Snippet Processing Unit (SPU) System Capacity Distribution in a shared-nothing architecture Zone maps & data organization

Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

Database Hardware Architecture Terminology

Symmetrical Multi Process Architecture (SMP)


Multiple processors sharing access to disk and memory Processors operate asynchronously Examples: Sybase, Sybase IQ, Oracle, SQL Server, DB2

Massive Parallel Process Architecture (MPP)


Multiple processors, each with separate, dedicated memory and disk No hardware is shared between processors Processors are slaved to a controller Processors operate synchronously Implemented many different ways Netezza approach is unique Examples: Netezza, Teradata, Green Plum, Aster, Vertica
Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

Netezza Architecture

SQL command sent to Result hostsent from user to


user

Host collects and returns result sets

Host

Host compiles SQL, develops execution plan and sends code snippets to SPUs based on plan

SPUs pass data between each other as needed

SPU executes code snippet. All SPUs execute same code synchronously.
Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

Know Your Customers. Grow Your Business.

2011 Aginity LLC

Ipsos, January 2007

Snippet Processing Unit (SPU)

Each SPU has a dedicated 1TB disk Disk controlled by FPGA (Field Programmable Gate Array) CPU loads query code into FPGA FPGA executes code using memory and cache CPU performs additional processing to result set CPU communicates to controller and Know Your Customers. Grow Your Business. other SPUs via NIC
Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America 2011 Aginity LLC

Ipsos, January 2007

System Capacity TwinFin 12

12 S-Blades, each have 8 SPUs 4 SPUs are spares for fail over 92 usable SPUs Each SPU has a 1TB disk drive Disk divided into 3 330GB partitions:

Active data area Workspace Redundant copy of another SPUs active data area

Nominal 330GB capacity per SPU (uncompressed) Nominal 30TB capacity per system (uncompressed)
Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

MPP Shared Nothing Architecture

A SPU only has visibility to the data on its own disk. A SPU will broadcast data to other SPUs as needed (based on instructions from the controller) via the NIC. Broadcast data is received and held in memory or work area by the SPU for use in resolving the query. Each SPU operates synchronously with all other SPUs. Each execute the same snippet at the same time. The controllers execution plan ensures minimal data transfer between SPUs.
Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

Data Distribution

Data distribution to a SPU is defined by the DISTRIBUTE ON clause of the CREATE TABLE command Each table is distributed across all 92 SPUs You can distribute RANDOM or specify one or more columns in the table as a distribution key

Controller uses distribution key information when deciding the execution plan Joins between tables with the same distribution key that are joined using that key will execute entirely on that SPU without the need to broadcast data (table co-location) Joins to tables with RANDOM distribution will require data broadcast. The smaller of the two sets is broadcast

Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

10

Effects of Poor Data Distribution

System Capacity

The system is full when any one SPU is full A badly skewed distribution of a large table may fill up a SPU prematurely Try to keep skew under 10% (5% off average) for very large tables

System Performance

SQL code snippets are executed synchronously among all SPUs The total execution time for a snippet is the longest time among the SPUs The total execution time for a query is the sum of the longest times for each snippet Snippet execution time is proportionally related to the amount of data a SPU needs to process A query will only run as fast as the slowest SPU

Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

11

Data Distribution Co-Location

Co-Location is the physical placement of related data across

multiple tables onto the same SPU


Allows a SPU to join co-located data without interaction with other SPUs Speeds queries by eliminating data broadcast steps

Tables must have identical distribution key columns


RANDOM distribution will not co-locate Any difference in columns will result in a different, unrelated SPU assignment

Join expression must include all distribution key columns


Query compiler cannot assume co-location unless the join between two co-located tables includes equi-joins between all columns specified in the distribution key Ensure use by defining as few columns as possible as a distribution key
Know Your Customers. Grow Your Business.

Dont bother with small tables

No real advantage co-locating small tables

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

12

Data Distribution - Example

All 3 tables share ID_TS as a common key ID_TS is always used when joining these tables ID_TS is the appropriate column to use as a distribution key

CREATE TABLE TSVALUES ( ID_TSDATE INTEGER NOT NULL, ID_TS INTEGER NOT NULL, VALUE DOUBLE PRECISION NOT NULL, ID_SRC INTEGER NOT NULL, ID_STATUS INTEGER NOT NULL, TS_BEGIN TIMESTAMP NOT NULL, TS_END TIMESTAMP NOT NULL, SIGMA DOUBLE PRECISION NOT NULL ) Know Your Customers. Grow Your Business. DISTRIBUTE ON (ID_TS) ORGANIZE ON (TS_END, TS_BEGIN);

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

13

Data Distribution Example

Aginity Workbench

Provides ability to redistribute existing tables Allows you to view a tables distribution

Distribution display for TSValues:

Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

14

Co-Located Queries

SELECT N.ID_TS, N.TICKER, N.NAME, V.VALUE FROM TSNAMES N, TSVALUES V WHERE N.ID_TS = V.ID_TS AND V.TS_BEGIN BETWEEN N.TS_BEGIN AND N.TS_END AND V.TS_END = 1/1/3000;

The query will execute in parallel across all SPUs without data

sharing because:

Both tables are distributed on the same key (ID_TS) ID_TS is used to join the tables

Identical distribution key definitions across tables ensure rows

with the same distribution key values in those tables reside on the same disk and SPU.
Run times (493M row result set):

Know Your Customers. Grow Your Business. Co-located: 22.1 seconds TSNames distributed on RANDOM: 74.3 seconds

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

15

Distribution Key Advice

Keep distribution key as small as possible


One column is best Ensure full use in joins Tables with some keys in common will not result in co-location

Use a key that provides even distribution


Avoid skews > 10% in very large tables (> 5% nominal system capacity) Avoid skews > 5% in multiple same key tables with total size > 5% nominal system capacity Avoid skews > 20% in other tables May be necessary to add a column to the distribution key to reduce skew Dont worry about skew in small tables (< 0.1 % nominal system capacity)

Poor Distribution Keys


Dates cause query hotspots when queries based on date range Low cardinality columns Know Your Customers. Grow Your Business. Never use a column with cardinality less than the number of SPUs Columns with cardinality < 20 x #of SPUs may produce very bad skews
2011 Aginity LLC

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

Ipsos, January 2007

16

Zone Maps

A zone map records the high and low values of columns for rows in a data block on disk Netezza uses zone maps to skip data blocks that do not satisfy query predicates, speeding query execution Zone maps are automatically created for Integer, Date and Timestamp columns You can specify additional zone map columns in the ORGANIZE ON clause of the CREATE TABLE statement

Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

17

ORGANIZE ON Clause

Defines a Clustered Base Table Data is arranged (sorted) by the ORGANIZE ON columns during GROOM Improves effectiveness of zone maps Specify up to 4 columns Allowable data types:

INTEGER, DATE, TIMESTAMP CHAR, VARCHAR, NCHAR, NVARCHAR Only first 8 bytes used NUMERIC up to NUMERIC(18) FLOAT, DOUBLE Know Your Customers. Grow Your Business. BOOL TIME, TIME with Time zone INTERVAL
2011 Aginity LLC

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

Ipsos, January 2007

18

Using ORGANIZE ON

Select columns most likely used in WHERE clause Zone map efficiency diminishes for 3rd and 4th columns

Greater chance a wide range of values appearing in a data block

Arrange lower cardinality or clumpier columns first


Groups larger numbers of rows first, spread over more data blocks Increases likelihood subsequent columns spread over multiple blocks Clumpy large portion of rows have few unique values

GROOM table periodically


Reorganizes data according to ORGANIZE ON specification Removes deleted rows


Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

19

Using ORGANIZE ON - TSValues

All current rows have ts_end = 1/1/3000 (very clumpy) Current rows are of greatest interest (most often selected) ts_begin commonly used in most queries Specify ts_end first. Clumps all current rows into contiguous blocks. CREATE TABLE TSVALUES
( ID_TSDATE INTEGER NOT NULL, ID_TS INTEGER NOT NULL, VALUE DOUBLE PRECISION NOT NULL, ID_SRC INTEGER NOT NULL, ID_STATUS INTEGER NOT NULL, TS_BEGIN TIMESTAMP NOT NULL, TS_END TIMESTAMP NOT NULL, SIGMA DOUBLE PRECISION NOT NULL ) Know Your Customers. Grow Your Business. DISTRIBUTE ON (ID_TS) ORGANIZE ON (TS_END, TS_BEGIN);

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

20

GROOM Command

GROOM TABLE TSVALUES RECORDS ALL RECLAIM BACKUPSET DEFAULT;

Organizes rows based on ORGANIZE ON clause Reclaims space from deleted rows

Netezza performs updates by deleting old row and inserting new row Deleted rows consume space until table is GROOMed

Fast 15 minutes for 4.6 billion rows in TSValues Table remains available for query/update
Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

21

SQL Differences Netezza/Sybase

Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

22

Topics

General Language Characteristics Identifiers Data Types Functions Command differences

Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

23

General Language Characteristics

Similar to Oracle syntax and interpretation Commands terminated with semi-colon (;)

Terminator is required at all times

Uses Oracles interpretation of NULL


A zero length string () is considered NULL Assigning to a character column defined as NOT NULL will generate an error

Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

24

Identifiers

An identifier names a database object There are 2 types of Identifiers: regular and delimited Regular Identifier

Is case insensitive Must begin with a letter Contain letters, digits, underscores, dollar sign ($)

Delimited Identifier

Enclosed in double quotes Is case sensitive May include spaces, other special symbols and reserved words May begin with any allowable character
Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

25

Data Types

Does not support LOB types (CLOB, BLOB) Character types


CHAR, VARCHAR: ASCII data, 64,000 maximum length NCHAR, NVARCHAR: Unicode data, 16,000 maximum length

Exact Numeric Types


BYTEINT 8 bit signed integer SMALLINT 16 bit signed integer INTEGER 32 bit signed integer BIGINT 64 bit signed integer NUMERIC(p,s), NUMERIC(p) Up to 38 digits precision (p), scale (s) from 0 to p NUMERIC Same as NUMERIC(18, 0)
Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

26

Data Types (continued)

Approximate Numeric Types


FLOAT(p) precision can range from 1 to 15 REAL same as FLOAT(6) 4 bytes DOUBLE PRECISION same as FLOAT(15) 8 bytes

Logical Types

BOOLEAN (or BOOL)

True/false value, 1 byte

Temporal Types

Time maintained to microsecond (6 decimal places) DATE Date with no time 4 bytes TIME Time with no date 8 bytes TIME WITH TIME ZONE Time with time zone information 12 bytes TIMESTAMP Date and time 8 bytes INTERVAL Know Time interval, non-standard implementation Your Customers. Grow Your Business.

Maintained in seconds Ignores unit declarations, literals require explicit units Month assumed to be 30 days
2011 Aginity LLC

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

Ipsos, January 2007

27

Internal Data Types (pseudo-columns)

ROWID

Unique row identifier, assigned when row is inserted Not sequential in table Value range 100,000 - 9,223,372,036,854,775,807 Will numbers ever repeat? If a table contains 2 billion rows and takes 3 minutes to copy And if the table is copied repeatedly and continuously (add 2 billion rows every 3 minutes) It will take over 24,000 years to run out of numbers Netezza hopes you would have upgraded before then

CREATEXID, DELETEXID

Transaction ID that created and deleted the row. If DELETEXID > 0, row has been deleted You can not see deleted rows using SQL
Know Your Customers. Grow Your Business.

DATASLICEID

Identifies the SPU holding the row


2011 Aginity LLC

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

Ipsos, January 2007

28

Operators

Differences from Sybase


Concatenation Not Equal Null test

|| <> or != ISNULL or IS NULL NOTNULL or IS NOT NULL

Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

29

Conditional Column Expression Functions

CASE
Searched form: CASE WHEN <search-condition-1> THEN <result-1> WHEN <search-condition-2> THEN <result-2> ... WHEN <search-condition-n> THEN <result-n> ELSE <default-result> END Search conditions can be arbitrarily complex and results can be expressions. Value form: CASE <test-value> WHEN <comparand-value-1> THEN <result-1> WHEN <comparand-value-2> THEN <result-2> ... WHEN <comparand-value-n> THEN <result-n> ELSE <default-result> END Test values, comparand values, and results can be expressions.

DECODE

Same as Oracle DECODE decode(<expr>,<search1>,<result1>, <search N>,<result N>,<default>)


Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

30

Commonly Used Functions

Current time

Use NOW() or CURRENT_TIMESTAMP TIMEOFDAY() returns a verbose date string. Example: Mon Jan 24 16:12:05 2011 EST

Date Conversion

Date literal is in MM/DD/YYYY format (default). Example: 1/1/3000 TO_DATE(<text>, <template>) Converts string to DATE data type Template describes format of date string. Example: TO_DATE(24 Jan 2011, DD Mon YYYY) TO_TIMESTAMP (<text>, <template>) Converts string to TIMESTAMP data type TO_CHAR(date or timestamp, <template>) Converts date or timestamp to string See User Guide pp 3-26 3-28 for template information

Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

31

SQL Differences - Update

Sybase
update #tempRaw Set a.Sigma = b.Sigma from #tempRaw a,TSValues b where a.ID_TS = b.ID_TS and a.ID_TSDate = b.ID_TSDate and getdate() < b.ts_end and a.ID_Status >= 1 and a.ID_StatusSigma <= -1 and IsDelete = 0

Netezza
update tempRaw a Set Sigma = b.Sigma from TSValues b where a.ID_TS = b.ID_TS and a.ID_TSDate = b.ID_TSDate and now() < b.ts_end and a.ID_Status >= 1 and a.ID_StatusSigma <= -1 and a.isdelete = 0;

Special characters not used

for temporary tables


Table only referenced once
Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

32

SQL Differences - DELETE

DELETE may only reference one table

DELETE FROM <table> WHERE <condition>

Cannot directly convert:


DELETE #tempRaw FROM TSValues B , #tempRaw A where A.ID_TS = B.ID_TS and A.ID_TSDate = B.ID_TSDate and ( ( A.Value = B.Value and A.ID_Status = B.ID_Status ) or ( B.ID_Status >= 1 and A.ID_Status <= -1 )) AND ( abs(A.Sigma - B.Sigma) < 0.00000001 or A.ID_StatusSigma <= -1) and getdate() between B.ts_begin and B.ts_end AND A.Comments IS NULL

Use UPDATE then DELETE:


Add a boolean column deleteme to tempRaw table. Initialize to FALSE. UPDATE tempRaw a set deleteme = TRUE FROM TSValues B WHERE A.ID_TS = B.ID_TS AND A.ID_TSDate = B.ID_TSDate AND ( ( A.Value = B.Value and A.ID_Status = B.ID_Status ) OR ( B.ID_Status >= 1 and A.ID_Status <= -1 )) AND ( abs(A.Sigma - B.Sigma) < 0.00000001 OR A.ID_StatusSigma <= -1) AND NOW() BETWEEN B.ts_begin AND B.ts_end A.Comments IS NULL; Know Your Customers. Grow Your AND Business. DELETE FROM tempRaw WHERE deleteme;

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

33

SQL Differences

Materialized Views

View may have only one source table Limited to 64 columns May include an ORDER BY clause Optimize zone maps Cannot include an ORGANIZE ON clause

Sequences

You may specify data type (byteint, smallint, int, bigint) Default is bigint (64 bit signed integer) Blocks of numbers are allocated to each SPU during execution Sequential number assignment cannot be assured

Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

34

Stored Procedure Differences

Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

35

Topics

General Differences CREATE PROCEDURE statement Iterating through rows (cursors) Returning result sets System Variables & Special Functions

Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

36

General Differences

Statements are terminated with semi-colons (;) Variables are not prefixed with special characters := symbol used for variable assignment
a := 1;

Syntax similar to Oracle PL/SQL

Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

37

CREATE PROCEDURE

Sybase
CREATE PROCEDURE sp_TS_ProcessFillValues (@LoaderID int, @ID_SRC int) AS DECLARE @err_no int, @err_msg varchar(250), @CurrentTime datetime <...procedure code ...> GO Parameters are unnamed, referenced by position: $1, $2 Alias may be used to name parameters Procedure enclosed in a BEGIN_PROC END_PROC block begin <... procedure code ...> end; end_proc;

Netezza
CREATE PROCEDURE sp_TS_ProcessFillValues (INTEGER, INTEGER) RETURNS INTEGER EXECUTE AS CALLER LANGUAGE NZPLSQL AS begin_proc DECLARE LoaderID alias for $1; ID_SRC ALIAS FOR $2; CurrentTime datetime; EndDate datetime; StartDate datetime;

Executable code enclosed in a BEGIN END block following declarations. Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

38

Iterating Through Rows

Netezza does not support cursor data type Use looping to iterate through a result set:
FOR record | row IN select_clause LOOP statements END LOOP; Alternate form: FOR record | row IN EXECUTE text_expression LOOP statements END LOOP;

Example
DECLARE myrec RECORD; mytotal DOUBLE PRECISION; BEGIN Know Your Customers. Grow*Your Business. FOR myrec IN SELECT FROM tsvalues LOOP mytotal := mytotal + myrec.value; END LOOP; END;
Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America 2011 Aginity LLC

Ipsos, January 2007

39

Returning Sets from Stored Procedures

Use RETURNS REFTABLE (<table name>) clause


<table name> refers to a table defined in the schema Used to define the format of the result set being returned (metadata only)

Use REFTABLENAME to refer to the result set


REFTABLENAME is a system provided variable System generates a unique table name Essentially a CREATE TEMP TABLE command Generate result set using EXECUTE IMMEDIATE Example: EXECUTE IMMEDIATE INSERT INTO || REFTABLENAME || VALUES ();

Use RETURN REFTABLE to return result set


Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

40

System Variables and Special Functions

SQLERRM

Error message relating to last SQL error

ROW_COUNT

Number of rows affected by last executed SQL statement

QUOTE_LITERAL(<expression>)

Used when building EXECUTE IMMEDIATE strings Converts the value of the expression into a properly formatted literal string Escapes any special characters as necessary.

QUOTE_IDENT(<expression>)

Used when building EXECUTE IMMEDIATE strings Converts expression into properly formatted database object name string Mainly used for delimited identifiers Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

41

Netezza Manuals

Full documentation can be found at:

\\crpnycnaf00n2\ReconDQ\Netezza\Manuals

Know Your Customers. Grow Your Business.

Introduction to Netezza Ipsos Loyalty |Prepared for Bank of America

2011 Aginity LLC

Ipsos, January 2007

42

You might also like