You are on page 1of 62

Really Big Elephants

Data Warehousing

Josh Berkus MySQL User Conference 2011

I will cover:

I won't cover:

advantages of Postgres for DW configuration tablespaces ETL/ELT windowing partitioning materialized views

hardware selection EAV / blobs denormalization DW query tuning external DW tools backups & upgrades

What is a data warehouse?

synonyms etc.

Business Intelligence

also BI/DW

Analytics database OnLine Analytical Processing (OLAP) Data Mining Decision Support



few large batch imports years of data queries generated by large reports queries can run for hours 5x to 2000x RAM

many single-row writes current data queries generated by user activity < 1s response times 0.5 to 5x RAM



1 to 10 users no constraints

100 to 1000 users constraints

Why use PostgreSQL for data warehousing?

Complex Queries
SELECT CASE WHEN ((SUM(inventory.closed_on_hand) + SUM(changes.received) + SUM(changes.adjustments) + SUM(changes.transferred_in-changes.transferred_out)) <> 0) THEN ROUND((CAST(SUM(changes.sold_and_closed + changes.returned_and_closed) AS numeric) * 100) / CAST(SUM(starting.closed_on_hand) + SUM(changes.received) + SUM(changes.adjustments) + SUM(changes.transferred_in-changes.transferred_out) AS numeric), 5) ELSE 0 END AS "Percent_Sold", CASE WHEN (SUM(changes.sold_and_closed) <> 0) THEN ROUND(100*((SUM(changes.closed_markdown_units_sold)*1.0) / SUM(changes.sold_and_closed)), 5) ELSE 0 END AS "Percent_of_Units_Sold_with_Markdown", CASE WHEN (SUM(changes.sold_and_closed * _sku.retail_price) <> 0) THEN ROUND(100*(SUM(changes.closed_markdown_dollars_sold)*1.0) / SUM(changes.sold_and_closed * _sku.retail_price), 5) ELSE 0 END AS "Markdown_Percent", '0' AS "Percent_of_Total_Sales", CASE WHEN SUM((changes.sold_and_closed + changes.returned_and_closed) * _sku.retail_price) IS NULL THEN 0 ELSE SUM((changes.sold_and_closed + changes.returned_and_closed) * _sku.retail_price) END AS "Net_Sales_at_Retail", '0' AS "Percent_of_Ending_Inventory_at_Retail", SUM(inventory.closed_on_hand * _sku.retail_price) AS "Ending_Inventory_at_Retail", "_store"."label" AS "Store", "_department"."label" AS "Department", "_vendor"."name" AS "Vendor_Name" FROM inventory JOIN inventory as starting ON inventory.warehouse_id = starting.warehouse_id AND inventory.sku_id = starting.sku_id LEFT OUTER JOIN ( SELECT warehouse_id, sku_id, sum(received) as received, sum(transferred_in) as transferred_in, sum(transferred_out) as transferred_out, sum(adjustments) as adjustments, sum(sold) as sold FROM movement WHERE movement.movement_date BETWEEN '2010-08-05' AND '2010-08-19' GROUP BY sku_id, warehouse_id ) as changes ON inventory.warehouse_id = changes.warehouse_id AND inventory.sku_id = changes.sku_id JOIN _sku ON = inventory.sku_id JOIN _warehouse ON = inventory.warehouse_id JOIN _location_hierarchy AS _store ON = _warehouse.store_id AND _store.type = 'Store' JOIN _product ON = _sku.product_id JOIN _merchandise_hierarchy AS _department

Complex Queries

JOIN optimization

5 different JOIN types approximate planning for 20+ table joins plus nested subqueries

subqueries in any clause

windowing queries recursive queries

Big Data Features

big tables big databases big backups big updates big queries

partitioning tablespaces PITR binary replication resource control


add data analysis functionality from external libraries inside the database

financial analysis genetic sequencing approximate queries data types aggregates functions operators

create your own:

I'm running a partitioning scheme using 256 tables with a maximum of 16 million rows (namely IPv4-addresses) and a current total of about 2.5 billion rows, there are no deletes though, but lots of updates. I use PostgreSQL basically as a data warehouse to store all the genetic data that our lab generates With this configuration I figure I'll have ~3TB for my main data tables and 1TB for indexes.

lots of experience with large databases blogs, tools, online help

Sweet Spot
0 5 10 15 20 25 30



DW Database
0 5 10 15 20 25 30

DW Databases

Vertica Greenplum Aster Data Infobright Teradata Hadoop/HBase

Netezza HadoopDB LucidDB MonetDB SciDB Paraccel

DW Databases

Vertica Greenplum Aster Data Infobright Teradata Hadoop/HBase

Netezza HadoopDB LucidDB MonetDB SciDB Paraccel

How do I configure PostgreSQL for data warehousing?

General Setup

Latest version of PostgreSQL System with lots of drives

6 to 48 drives

or 2 to 12 SSDs

High-throughput RAID 10 to 50 GB space

Write ahead log (WAL) on separate disk(s)

separate the DW workload onto its own server

few connections
max_connections = 10 to 40

raise those memory limits!

shared_buffers = 1/8 to of RAM work_mem = 128MB to 1GB maintenance_work_mem = 512MB to 1GB temp_buffers = 128MB to 1GB effective_cache_size = of RAM wal_buffers = 16MB

No autovacuum
autovacuum = off vacuum_cost_delay = off

do your VACUUMs and ANALYZEs as part of the batch load process

usually several of them

also maintain tables by partitioning

What are tablespaces?

logical data extents

lets you put some of your data on specific devices / disks

CREATE TABLESPACE 'history_log' LOCATION '/mnt/san2/history_log'; ALTER TABLE history_log TABLESPACE history_log;

tablespace reasons

parallelize access

your largest fact table on one tablespace its indexes on another

not as useful if you have a good SAN

temp tablespace for temp tables move key join tables to SSD migrate to new storage one table at a time

What is ETL and how do I do it?

Extract, Transform, Load

how you turn external raw data into normalized database data

Apache logs web analytics DB CSV POS files financial reporting DB OLTP server 10-year data warehouse

also called ELT when the transformation is done inside the database

PostgreSQL is particularly good for ELT


batch INSERTs into 100's or 1000's per transaction

row-at-a-time is very slow

create and load import tables in one transaction add indexes and constraints after load insert several streams in parallel

but not more than CPU cores


Powerful, efficient delimited file loader

almost bug-free - we use it for backup 3-5X faster than inserts works with most delimited files also have to know structure in advance try pg_loader for better COPY

Not fault-tolerant

COPY weblog_new FROM '/mnt/transfers/weblogs/weblog20110605.csv' with csv; COPY traffic_snapshot FROM 'traffic_20110605192241' delimiter '|' nulls as 'N'; \copy weblog_summary_june TO 'Desktop/weblog-june2011.csv' with csv header;

L: in 9.1: FDW
CREATE FOREIGN TABLE raw_hits ( hit_time TIMESTAMP, page TEXT ) SERVER file_fdw OPTIONS (format 'csv', delimiter ';', filename '/var/log/hits.log');

L: in 9.1: FDW
CREATE TABLE hits_2011041617 AS SELECT page, count(*) FROM raw_hits WHERE hit_time > '2011-04-16 16:00:00' AND hit_time <= '2011-04-16 17:00:00' GROUP BY page;

T: temporary tables
CREATE TEMPORARY TABLE ON COMMIT DROP sales_records_june_rollup AS SELECT seller_id, location, sell_date, sum(sale_amount), array_agg(item_id) FROM raw_sales WHERE sell_date BETWEEN '2011-06-01' AND '2011-06-30 23:59:59.999' GROUP BY seller_id, location, sell_date;

in 9.1: unlogged tables

like myISAM without the risk

CREATE UNLOGGED TABLE cleaned_log_import AS SELECT hit_time, page FROM raw_hits, hit_watermark WHERE hit_time > last_watermark AND is_valid(page);

T: stored procedures

multiple languages

SQL PL/pgSQL PL/Perl PL/Python PL/PHP PL/R PL/Java allows you to use exernal data processing libraries in the database

custom aggregates, operators, more

CREATE OR REPLACE FUNCTION normalize_query ( queryin text ) RETURNS TEXT LANGUAGE PLPERL STABLE STRICT AS $f$ # this function "normalizes" queries by stripping out constants. # some regexes by Guillaume Smet under The PostgreSQL License. local $_ = $_[0]; #first cleanup the whitespace s/\s+/ /g; s/\s,/,/g; s/,(\S)/, $1/g; s/^\s//g; s/\s$//g; #remove any double quotes and quoted text s/\\'//g; s/'[^']*'/''/g; s/''('')+/''/g; #remove TRUE and FALSE s/(\W)TRUE(\W)/$1BOOL$2/gi; s/(\W)FALSE(\W)/$1BOOL$2/gi; #remove any bare numbers or hex numbers s/([^a-zA-Z_\$-])-?([0-9]+)/${1}0/g; s/([^a-z_\$-])0x[0-9a-f]{1,10}/${1}0x/ig; #normalize any IN statements s/(IN\s*)\([\'0x,\s]*\)/${1}(...)/ig; #return the normalized query return $_; $f$;

CREATE OR REPLACE FUNCTION f_graph2() RETURNS text AS ' sql <- paste("SELECT id as x,hit as y FROM mytemp LIMIT 30",sep=""); str <- c(pg.spi.exec(sql)); mymain <- "Graph 2"; mysub <- paste("The worst offender is: ",str[1,3]," with ",str[1,2]," hits",sep=""); myxlab <- "Top 30 IP Addresses"; myylab <- "Number of Hits"; pdf(''/tmp/graph2.pdf''); plot(str,type="b",main=mymain,sub=mysub,xlab=myxlab,ylab =myylab,lwd=3); mtext("Probes by intrusive IP Addresses",side=3);; print(''DONE''); ' LANGUAGE plr;

ELT Tips

bulk insert into a new table instead of updating/deleting an existing table update all columns in one operation instead of one at a time use views and custom functions to simplify your queries inserting into your long-term tables should be the very last step no updates after!

What's a windowing query?

regular aggregate

windowing function

TABLE events ( event_id INT, event_type TEXT, start TIMESTAMPTZ, duration INTERVAL, event_desc TEXT );

SELECT MAX(concurrent) FROM ( SELECT SUM(tally) OVER (ORDER BY start) AS concurrent FROM ( SELECT start, 1::INT as tally FROM events UNION ALL SELECT (start + duration), -1 FROM events ) AS event_vert) AS ec;

UPDATE partition_name SET drop_month = dropit FROM ( SELECT round_id, CASE WHEN ( ( row_number() over (partition by team_id order by team_id, total_points) ) <= ( drop_lowest ) ) THEN 0 ELSE 1 END as dropit FROM ( SELECT team.team_id, round.round_id, month_points as total_points, row_number() OVER ( partition by team.team_id, kal.positions order by team.team_id, kal.positions, month_points desc ) as ordinal, at_least, numdrop as drop_lowest FROM partition_name as rdrop JOIN round USING (round_id) JOIN team USING (team_id) JOIN pick ON round.round_id = pick.round_id and pick.pick_period @> this_period LEFT OUTER JOIN keep_at_least kal ON rdrop.pool_id = kal.pool_id and pick.position_id = any ( kal.positions ) WHERE rdrop.pool_id = this_pool AND team.team_id = this_team ) as ranking WHERE ordinal > at_least or at_least is null ) as droplow WHERE droplow.round_id = partition_name .round_id AND partition_name .pool_id = this_pool AND dropit = 0;

SELECT round_id, CASE WHEN ( ( row_number() OVER (partition by team_id order by team_id, total_points) ) <= ( drop_lowest ) ) THEN 0 ELSE 1 END as dropit FROM ( SELECT team.team_id, round.round_id, month_points as total_points, row_number() OVER ( partition by team.team_id, kal.positions order by team.team_id, kal.positions, month_points desc ) as ordinal

stream processing SQL

replace multiple queries with a single query

avoid scanning large tables multiple times and MB of data transmission (for some data mining tasks)

replace pages of application code

SQL alternative to map/reduce

How do I partition my tables?

Postgres partitioning

based on table inheritance and constraint exclusion

partitions are also full tables explicit constraints define the range of the partion triggers or RULEs handle insert/update

CREATE TABLE sales ( sell_date TIMESTAMPTZ NOT NULL, seller_id INT NOT NULL, item_id INT NOT NULL, sale_amount NUMERIC NOT NULL, narrative TEXT );

CREATE TABLE sales_2011_06 ( CONSTRAINT partition_date_range CHECK (sell_date >= '2011-06-01' AND sell_date < '2011-07-01' ) ) INHERITS ( sales );

CREATE FUNCTION sales_insert () RETURNS trigger LANGUAGE plpgsql AS $f$ BEGIN CASE WHEN sell_date < '2011-06-01' THEN INSERT INTO sales_2011_05 VALUES (NEW.*) WHEN sell_date < '2011-07-01' THEN INSERT INTO sales_2011_06 VALUES (NEW.*) WHEN sell_date >= '2011-07-01' THEN INSERT INTO sales_2011_07 VALUES (NEW.*) ELSE INSERT INTO sales_overflow VALUES (NEW.*) END; RETURN NULL; END;$f$; CREATE TRIGGER sales_insert BEFORE INSERT ON sales FOR EACH ROW EXECUTE PROCEDURE sales_insert();

Postgres partitioning

Good for:

Bad for:

rolling off data DB maintenance queries which use the partition key under 300 partitions insert performance

administration queries which do not use the partition key JOINs over 300 partitions update performance

you need a data expiration policy

you can't plan your DW otherwise

sets your storage requirements lets you project how queries will run when database is full people don't like talking about deleting data

will take a lot of meetings

you need a data expiration policy

raw import data detail-level transactions detail-level web logs rollups

1 month 3 years 1 year 10 years

What's a materialized view?

query results as table

calculate once, read many time

complex/expensive queries frequently referenced often part of a query automagic support not complete yet

not necessarily a whole query

manually maintained in PostgreSQL

SELECT page, COUNT(*) as total_hits FROM hit_counter WHERE date_trunc('day', hit_date) BETWEEN ( now() AND now() - INTERVAL '7 days' ) ORDER BY total_hits DESC LIMIT 10;

CREATE TABLE page_hits ( page TEXT, hit_day DATE, total_hits INT, CONSTRAINT page_hits_pk PRIMARY KEY(hit_day, page) );

each day: INSERT INTO page_hits SELECT page, date_trunc('day', hit_date) as hit_day, COUNT(*) as total_hits FROM hit_counter WHERE date_trunc('day', hit_date) = date_trunc('day', now() - INTERVAL '1 day') ORDER BY total_hits DESC;

SELECT page, total_hits FROM page_hits WHERE hit_date BETWEEN now() AND now() - INTERVAL '7 days';

maintaining matviews
BEST: GOOD: BAD for DW: update matviews at batch load time update matview according to clock/calendar update matviews using a trigger

matview tips

matviews should be small

1/10 to of RAM

each matview should support several queries

or one really really important one

truncate + insert, don't update index matviews like crazy


Josh Berkus:

blog: pgexperts: pgCon: Ottawa: May 17-20 OpenSourceBridge: Portland: June
This talk is copyright 2010 Josh Berkus and is licensed under the creative commons attribution license. Special thanks for materials to: Elein Mustain (PL/R), Hitoshi Harada and David Fetter (windowing functions), Andrew Dunstan (file_FDW)


Upcoming Events

You might also like