Really Big Elephants: Data Warehousing Postgresql

Really Big Elephants
Data Warehousing
with
PostgreSQL
Josh Berkus MySQL User Conference 2011
Included/Excluded
I will cover:
I won't cover:

advantages of Postgres for DW configuration tablespaces ETL/ELT windowing partitioning materialized views
hardware selection EAV / blobs denormalization DW query tuning external DW tools backups & upgrades
What is a data warehouse?
synonyms etc.
Business Intelligence
also BI/DW
Analytics database OnLine Analytical Processing (OLAP) Data Mining Decision Support
OLTP
vs

DW
few large batch imports years of data queries generated by large reports queries can run for hours 5x to 2000x RAM
many single-row writes current data queries generated by user activity < 1s response times 0.5 to 5x RAM
OLTP

vs

DW
1 to 10 users no constraints
100 to 1000 users constraints
Why use PostgreSQL for data warehousing?
Complex Queries
SELECT CASE WHEN ((SUM(inventory.closed_on_hand) + SUM(changes.received) + SUM(changes.adjustments) + SUM(changes.transferred_in-changes.transferred_out)) <> 0) THEN ROUND((CAST(SUM(changes.sold_and_closed + changes.returned_and_closed) AS numeric) * 100) / CAST(SUM(starting.closed_on_hand) + SUM(changes.received) + SUM(changes.adjustments) + SUM(changes.transferred_in-changes.transferred_out) AS numeric), 5) ELSE 0 END AS "Percent_Sold", CASE WHEN (SUM(changes.sold_and_closed) <> 0) THEN ROUND(100*((SUM(changes.closed_markdown_units_sold)*1.0) / SUM(changes.sold_and_closed)), 5) ELSE 0 END AS "Percent_of_Units_Sold_with_Markdown", CASE WHEN (SUM(changes.sold_and_closed * _sku.retail_price) <> 0) THEN ROUND(100*(SUM(changes.closed_markdown_dollars_sold)*1.0) / SUM(changes.sold_and_closed * _sku.retail_price), 5) ELSE 0 END AS "Markdown_Percent", '0' AS "Percent_of_Total_Sales", CASE WHEN SUM((changes.sold_and_closed + changes.returned_and_closed) * _sku.retail_price) IS NULL THEN 0 ELSE SUM((changes.sold_and_closed + changes.returned_and_closed) * _sku.retail_price) END AS "Net_Sales_at_Retail", '0' AS "Percent_of_Ending_Inventory_at_Retail", SUM(inventory.closed_on_hand * _sku.retail_price) AS "Ending_Inventory_at_Retail", "_store"."label" AS "Store", "_department"."label" AS "Department", "_vendor"."name" AS "Vendor_Name" FROM inventory JOIN inventory as starting ON inventory.warehouse_id = starting.warehouse_id AND inventory.sku_id = starting.sku_id LEFT OUTER JOIN ( SELECT warehouse_id, sku_id, sum(received) as received, sum(transferred_in) as transferred_in, sum(transferred_out) as transferred_out, sum(adjustments) as adjustments, sum(sold) as sold FROM movement WHERE movement.movement_date BETWEEN '2010-08-05' AND '2010-08-19' GROUP BY sku_id, warehouse_id ) as changes ON inventory.warehouse_id = changes.warehouse_id AND inventory.sku_id = changes.sku_id JOIN _sku ON _sku.id = inventory.sku_id JOIN _warehouse ON _warehouse.id = inventory.warehouse_id JOIN _location_hierarchy AS _store ON _store.id = _warehouse.store_id AND _store.type = 'Store' JOIN _product ON _product.id = _sku.product_id JOIN _merchandise_hierarchy AS _department
Complex Queries
JOIN optimization

5 different JOIN types approximate planning for 20+ table joins plus nested subqueries
subqueries in any clause
windowing queries recursive queries
Big Data Features

big tables big databases big backups big updates big queries
partitioning tablespaces PITR binary replication resource control
Extensibility
add data analysis functionality from external libraries inside the database

financial analysis genetic sequencing approximate queries data types aggregates functions operators
create your own:

Community
I'm running a partitioning scheme using 256 tables with a maximum of 16 million rows (namely IPv4-addresses) and a current total of about 2.5 billion rows, there are no deletes though, but lots of updates. I use PostgreSQL basically as a data warehouse to store all the genetic data that our lab generates With this configuration I figure I'll have ~3TB for my main data tables and 1TB for indexes.

lots of experience with large databases blogs, tools, online help
Sweet Spot
0 5 10 15 20 25 30
MySQL
PostgreSQL
DW Database
0 5 10 15 20 25 30
DW Databases

Vertica Greenplum Aster Data Infobright Teradata Hadoop/HBase
Netezza HadoopDB LucidDB MonetDB SciDB Paraccel
DW Databases

Vertica Greenplum Aster Data Infobright Teradata Hadoop/HBase
Netezza HadoopDB LucidDB MonetDB SciDB Paraccel
How do I configure PostgreSQL for data warehousing?
General Setup

Latest version of PostgreSQL System with lots of drives
6 to 48 drives
or 2 to 12 SSDs
High-throughput RAID 10 to 50 GB space
Write ahead log (WAL) on separate disk(s)
separate the DW workload onto its own server
Settings
few connections
max_connections = 10 to 40
raise those memory limits!

shared_buffers = 1/8 to of RAM work_mem = 128MB to 1GB maintenance_work_mem = 512MB to 1GB temp_buffers = 128MB to 1GB effective_cache_size = of RAM wal_buffers = 16MB
No autovacuum
autovacuum = off vacuum_cost_delay = off
do your VACUUMs and ANALYZEs as part of the batch load process
usually several of them
also maintain tables by partitioning
What are tablespaces?
logical data extents
lets you put some of your data on specific devices / disks
CREATE TABLESPACE 'history_log' LOCATION '/mnt/san2/history_log'; ALTER TABLE history_log TABLESPACE history_log;
tablespace reasons
parallelize access

your largest fact table on one tablespace its indexes on another
not as useful if you have a good SAN
temp tablespace for temp tables move key join tables to SSD migrate to new storage one table at a time
What is ETL and how do I do it?
Extract, Transform, Load
how you turn external raw data into normalized database data

Apache logs web analytics DB CSV POS files financial reporting DB OLTP server 10-year data warehouse
also called ELT when the transformation is done inside the database
PostgreSQL is particularly good for ELT
L: INSERT
batch INSERTs into 100's or 1000's per transaction
row-at-a-time is very slow
create and load import tables in one transaction add indexes and constraints after load insert several streams in parallel
but not more than CPU cores
L: COPY
Powerful, efficient delimited file loader

almost bug-free - we use it for backup 3-5X faster than inserts works with most delimited files also have to know structure in advance try pg_loader for better COPY
Not fault-tolerant

L: COPY
COPY weblog_new FROM '/mnt/transfers/weblogs/weblog20110605.csv' with csv; COPY traffic_snapshot FROM 'traffic_20110605192241' delimiter '|' nulls as 'N'; \copy weblog_summary_june TO 'Desktop/weblog-june2011.csv' with csv header;
L: in 9.1: FDW
CREATE FOREIGN TABLE raw_hits ( hit_time TIMESTAMP, page TEXT ) SERVER file_fdw OPTIONS (format 'csv', delimiter ';', filename '/var/log/hits.log');
L: in 9.1: FDW
CREATE TABLE hits_2011041617 AS SELECT page, count(*) FROM raw_hits WHERE hit_time > '2011-04-16 16:00:00' AND hit_time <= '2011-04-16 17:00:00' GROUP BY page;
T: temporary tables
CREATE TEMPORARY TABLE ON COMMIT DROP sales_records_june_rollup AS SELECT seller_id, location, sell_date, sum(sale_amount), array_agg(item_id) FROM raw_sales WHERE sell_date BETWEEN '2011-06-01' AND '2011-06-30 23:59:59.999' GROUP BY seller_id, location, sell_date;
in 9.1: unlogged tables
like myISAM without the risk
CREATE UNLOGGED TABLE cleaned_log_import AS SELECT hit_time, page FROM raw_hits, hit_watermark WHERE hit_time > last_watermark AND is_valid(page);
T: stored procedures
multiple languages

SQL PL/pgSQL PL/Perl PL/Python PL/PHP PL/R PL/Java allows you to use exernal data processing libraries in the database
custom aggregates, operators, more
CREATE OR REPLACE FUNCTION normalize_query ( queryin text ) RETURNS TEXT LANGUAGE PLPERL STABLE STRICT AS $f$ # this function "normalizes" queries by stripping out constants. # some regexes by Guillaume Smet under The PostgreSQL License. local $_ = $_[0]; #first cleanup the whitespace s/\s+/ /g; s/\s,/,/g; s/,(\S)/, $1/g; s/^\s//g; s/\s$//g; #remove any double quotes and quoted text s/\\'//g; s/'[^']*'/''/g; s/''('')+/''/g; #remove TRUE and FALSE s/(\W)TRUE(\W)/$1BOOL$2/gi; s/(\W)FALSE(\W)/$1BOOL$2/gi; #remove any bare numbers or hex numbers s/([^a-zA-Z_\$-])-?([0-9]+)/${1}0/g; s/([^a-z_\$-])0x[0-9a-f]{1,10}/${1}0x/ig; #normalize any IN statements s/(IN\s*)$[\'0x,\s]*$/${1}(...)/ig; #return the normalized query return $_; $f$;
CREATE OR REPLACE FUNCTION f_graph2() RETURNS text AS ' sql <- paste("SELECT id as x,hit as y FROM mytemp LIMIT 30",sep=""); str <- c(pg.spi.exec(sql)); mymain <- "Graph 2"; mysub <- paste("The worst offender is: ",str[1,3]," with ",str[1,2]," hits",sep=""); myxlab <- "Top 30 IP Addresses"; myylab <- "Number of Hits"; pdf(''/tmp/graph2.pdf''); plot(str,type="b",main=mymain,sub=mysub,xlab=myxlab,ylab =myylab,lwd=3); mtext("Probes by intrusive IP Addresses",side=3); dev.off(); print(''DONE''); ' LANGUAGE plr;
ELT Tips
bulk insert into a new table instead of updating/deleting an existing table update all columns in one operation instead of one at a time use views and custom functions to simplify your queries inserting into your long-term tables should be the very last step no updates after!
What's a windowing query?
regular aggregate
windowing function
TABLE events ( event_id INT, event_type TEXT, start TIMESTAMPTZ, duration INTERVAL, event_desc TEXT );
SELECT MAX(concurrent) FROM ( SELECT SUM(tally) OVER (ORDER BY start) AS concurrent FROM ( SELECT start, 1::INT as tally FROM events UNION ALL SELECT (start + duration), -1 FROM events ) AS event_vert) AS ec;
UPDATE partition_name SET drop_month = dropit FROM ( SELECT round_id, CASE WHEN ( ( row_number() over (partition by team_id order by team_id, total_points) ) <= ( drop_lowest ) ) THEN 0 ELSE 1 END as dropit FROM ( SELECT team.team_id, round.round_id, month_points as total_points, row_number() OVER ( partition by team.team_id, kal.positions order by team.team_id, kal.positions, month_points desc ) as ordinal, at_least, numdrop as drop_lowest FROM partition_name as rdrop JOIN round USING (round_id) JOIN team USING (team_id) JOIN pick ON round.round_id = pick.round_id and pick.pick_period @> this_period LEFT OUTER JOIN keep_at_least kal ON rdrop.pool_id = kal.pool_id and pick.position_id = any ( kal.positions ) WHERE rdrop.pool_id = this_pool AND team.team_id = this_team ) as ranking WHERE ordinal > at_least or at_least is null ) as droplow WHERE droplow.round_id = partition_name .round_id AND partition_name .pool_id = this_pool AND dropit = 0;
SELECT round_id, CASE WHEN ( ( row_number() OVER (partition by team_id order by team_id, total_points) ) <= ( drop_lowest ) ) THEN 0 ELSE 1 END as dropit FROM ( SELECT team.team_id, round.round_id, month_points as total_points, row_number() OVER ( partition by team.team_id, kal.positions order by team.team_id, kal.positions, month_points desc ) as ordinal
stream processing SQL
replace multiple queries with a single query
avoid scanning large tables multiple times and MB of data transmission (for some data mining tasks)
replace pages of application code
SQL alternative to map/reduce
How do I partition my tables?
Postgres partitioning
based on table inheritance and constraint exclusion

partitions are also full tables explicit constraints define the range of the partion triggers or RULEs handle insert/update
CREATE TABLE sales ( sell_date TIMESTAMPTZ NOT NULL, seller_id INT NOT NULL, item_id INT NOT NULL, sale_amount NUMERIC NOT NULL, narrative TEXT );
CREATE TABLE sales_2011_06 ( CONSTRAINT partition_date_range CHECK (sell_date >= '2011-06-01' AND sell_date < '2011-07-01' ) ) INHERITS ( sales );
CREATE FUNCTION sales_insert () RETURNS trigger LANGUAGE plpgsql AS $f$ BEGIN CASE WHEN sell_date < '2011-06-01' THEN INSERT INTO sales_2011_05 VALUES (NEW.*) WHEN sell_date < '2011-07-01' THEN INSERT INTO sales_2011_06 VALUES (NEW.*) WHEN sell_date >= '2011-07-01' THEN INSERT INTO sales_2011_07 VALUES (NEW.*) ELSE INSERT INTO sales_overflow VALUES (NEW.*) END; RETURN NULL; END;$f$; CREATE TRIGGER sales_insert BEFORE INSERT ON sales FOR EACH ROW EXECUTE PROCEDURE sales_insert();
Postgres partitioning
Good for:

Bad for:

rolling off data DB maintenance queries which use the partition key under 300 partitions insert performance
administration queries which do not use the partition key JOINs over 300 partitions update performance
you need a data expiration policy
you can't plan your DW otherwise

sets your storage requirements lets you project how queries will run when database is full people don't like talking about deleting data
will take a lot of meetings
you need a data expiration policy

raw import data detail-level transactions detail-level web logs rollups
1 month 3 years 1 year 10 years
What's a materialized view?
query results as table
calculate once, read many time

complex/expensive queries frequently referenced often part of a query automagic support not complete yet
not necessarily a whole query
manually maintained in PostgreSQL
SELECT page, COUNT(*) as total_hits FROM hit_counter WHERE date_trunc('day', hit_date) BETWEEN ( now() AND now() - INTERVAL '7 days' ) ORDER BY total_hits DESC LIMIT 10;
CREATE TABLE page_hits ( page TEXT, hit_day DATE, total_hits INT, CONSTRAINT page_hits_pk PRIMARY KEY(hit_day, page) );
each day: INSERT INTO page_hits SELECT page, date_trunc('day', hit_date) as hit_day, COUNT(*) as total_hits FROM hit_counter WHERE date_trunc('day', hit_date) = date_trunc('day', now() - INTERVAL '1 day') ORDER BY total_hits DESC;
SELECT page, total_hits FROM page_hits WHERE hit_date BETWEEN now() AND now() - INTERVAL '7 days';
maintaining matviews
BEST: GOOD: BAD for DW: update matviews at batch load time update matview according to clock/calendar update matviews using a trigger
matview tips
matviews should be small
1/10 to of RAM
each matview should support several queries
or one really really important one
truncate + insert, don't update index matviews like crazy
Contact
Josh Berkus: josh@pgexperts.com
blog: blogs.ittoolbox.com/database/soup pgexperts: www.pgexperts.com pgCon: Ottawa: May 17-20 OpenSourceBridge: Portland: June
This talk is copyright 2010 Josh Berkus and is licensed under the creative commons attribution license. Special thanks for materials to: Elein Mustain (PL/R), Hitoshi Harada and David Fetter (windowing functions), Andrew Dunstan (file_FDW)
PostgreSQL: www.postgresql.org
Upcoming Events

Really Big Elephants: Data Warehousing Postgresql

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Really Big Elephants: Data Warehousing Postgresql

Uploaded by

Copyright:

Available Formats

Really Big Elephants

What is a data warehouse?

100 to 1000 users constraints

Why use PostgreSQL for data warehousing?

subqueries in any clause

windowing queries recursive queries

Big Data Features

partitioning tablespaces PITR binary replication resource control

create your own:

lots of experience with large databases blogs, tools, online help

Vertica Greenplum Aster Data Infobright Teradata Hadoop/HBase

Netezza HadoopDB LucidDB MonetDB SciDB Paraccel

Vertica Greenplum Aster Data Infobright Teradata Hadoop/HBase

Netezza HadoopDB LucidDB MonetDB SciDB Paraccel

How do I configure PostgreSQL for data warehousing?

Latest version of PostgreSQL System with lots of drives

High-throughput RAID 10 to 50 GB space

Write ahead log (WAL) on separate disk(s)

separate the DW workload onto its own server

raise those memory limits!

do your VACUUMs and ANALYZEs as part of the batch load process

usually several of them

also maintain tables by partitioning

What are tablespaces?

logical data extents

lets you put some of your data on specific devices / disks

your largest fact table on one tablespace its indexes on another

not as useful if you have a good SAN

What is ETL and how do I do it?

Extract, Transform, Load

PostgreSQL is particularly good for ELT

batch INSERTs into 100's or 1000's per transaction

row-at-a-time is very slow

but not more than CPU cores

Powerful, efficient delimited file loader

in 9.1: unlogged tables

like myISAM without the risk

custom aggregates, operators, more

What's a windowing query?

stream processing SQL

replace multiple queries with a single query

replace pages of application code

SQL alternative to map/reduce

How do I partition my tables?

based on table inheritance and constraint exclusion

you need a data expiration policy

you can't plan your DW otherwise

will take a lot of meetings

you need a data expiration policy

raw import data detail-level transactions detail-level web logs rollups

1 month 3 years 1 year 10 years

What's a materialized view?

query results as table

calculate once, read many time

not necessarily a whole query

manually maintained in PostgreSQL

matviews should be small

each matview should support several queries

or one really really important one

truncate + insert, don't update index matviews like crazy

Josh Berkus: josh@pgexperts.com

You might also like