Professional Documents
Culture Documents
Really Big Elephants: Data Warehousing Postgresql
Really Big Elephants: Data Warehousing Postgresql
Data Warehousing
with
PostgreSQL
Josh Berkus MySQL User Conference 2011
Included/Excluded
I will cover:
I won't cover:
advantages of Postgres for DW configuration tablespaces ETL/ELT windowing partitioning materialized views
hardware selection EAV / blobs denormalization DW query tuning external DW tools backups & upgrades
synonyms etc.
Business Intelligence
also BI/DW
Analytics database OnLine Analytical Processing (OLAP) Data Mining Decision Support
OLTP
vs
DW
few large batch imports years of data queries generated by large reports queries can run for hours 5x to 2000x RAM
many single-row writes current data queries generated by user activity < 1s response times 0.5 to 5x RAM
OLTP
vs
DW
1 to 10 users no constraints
Complex Queries
SELECT CASE WHEN ((SUM(inventory.closed_on_hand) + SUM(changes.received) + SUM(changes.adjustments) + SUM(changes.transferred_in-changes.transferred_out)) <> 0) THEN ROUND((CAST(SUM(changes.sold_and_closed + changes.returned_and_closed) AS numeric) * 100) / CAST(SUM(starting.closed_on_hand) + SUM(changes.received) + SUM(changes.adjustments) + SUM(changes.transferred_in-changes.transferred_out) AS numeric), 5) ELSE 0 END AS "Percent_Sold", CASE WHEN (SUM(changes.sold_and_closed) <> 0) THEN ROUND(100*((SUM(changes.closed_markdown_units_sold)*1.0) / SUM(changes.sold_and_closed)), 5) ELSE 0 END AS "Percent_of_Units_Sold_with_Markdown", CASE WHEN (SUM(changes.sold_and_closed * _sku.retail_price) <> 0) THEN ROUND(100*(SUM(changes.closed_markdown_dollars_sold)*1.0) / SUM(changes.sold_and_closed * _sku.retail_price), 5) ELSE 0 END AS "Markdown_Percent", '0' AS "Percent_of_Total_Sales", CASE WHEN SUM((changes.sold_and_closed + changes.returned_and_closed) * _sku.retail_price) IS NULL THEN 0 ELSE SUM((changes.sold_and_closed + changes.returned_and_closed) * _sku.retail_price) END AS "Net_Sales_at_Retail", '0' AS "Percent_of_Ending_Inventory_at_Retail", SUM(inventory.closed_on_hand * _sku.retail_price) AS "Ending_Inventory_at_Retail", "_store"."label" AS "Store", "_department"."label" AS "Department", "_vendor"."name" AS "Vendor_Name" FROM inventory JOIN inventory as starting ON inventory.warehouse_id = starting.warehouse_id AND inventory.sku_id = starting.sku_id LEFT OUTER JOIN ( SELECT warehouse_id, sku_id, sum(received) as received, sum(transferred_in) as transferred_in, sum(transferred_out) as transferred_out, sum(adjustments) as adjustments, sum(sold) as sold FROM movement WHERE movement.movement_date BETWEEN '2010-08-05' AND '2010-08-19' GROUP BY sku_id, warehouse_id ) as changes ON inventory.warehouse_id = changes.warehouse_id AND inventory.sku_id = changes.sku_id JOIN _sku ON _sku.id = inventory.sku_id JOIN _warehouse ON _warehouse.id = inventory.warehouse_id JOIN _location_hierarchy AS _store ON _store.id = _warehouse.store_id AND _store.type = 'Store' JOIN _product ON _product.id = _sku.product_id JOIN _merchandise_hierarchy AS _department
Complex Queries
JOIN optimization
5 different JOIN types approximate planning for 20+ table joins plus nested subqueries
big tables big databases big backups big updates big queries
Extensibility
add data analysis functionality from external libraries inside the database
financial analysis genetic sequencing approximate queries data types aggregates functions operators
Community
I'm running a partitioning scheme using 256 tables with a maximum of 16 million rows (namely IPv4-addresses) and a current total of about 2.5 billion rows, there are no deletes though, but lots of updates. I use PostgreSQL basically as a data warehouse to store all the genetic data that our lab generates With this configuration I figure I'll have ~3TB for my main data tables and 1TB for indexes.
Sweet Spot
0 5 10 15 20 25 30
MySQL
PostgreSQL
DW Database
0 5 10 15 20 25 30
DW Databases
DW Databases
General Setup
6 to 48 drives
or 2 to 12 SSDs
Settings
few connections
max_connections = 10 to 40
No autovacuum
autovacuum = off vacuum_cost_delay = off
CREATE TABLESPACE 'history_log' LOCATION '/mnt/san2/history_log'; ALTER TABLE history_log TABLESPACE history_log;
tablespace reasons
parallelize access
temp tablespace for temp tables move key join tables to SSD migrate to new storage one table at a time
how you turn external raw data into normalized database data
Apache logs web analytics DB CSV POS files financial reporting DB OLTP server 10-year data warehouse
also called ELT when the transformation is done inside the database
L: INSERT
create and load import tables in one transaction add indexes and constraints after load insert several streams in parallel
L: COPY
almost bug-free - we use it for backup 3-5X faster than inserts works with most delimited files also have to know structure in advance try pg_loader for better COPY
Not fault-tolerant
L: COPY
COPY weblog_new FROM '/mnt/transfers/weblogs/weblog20110605.csv' with csv; COPY traffic_snapshot FROM 'traffic_20110605192241' delimiter '|' nulls as 'N'; \copy weblog_summary_june TO 'Desktop/weblog-june2011.csv' with csv header;
L: in 9.1: FDW
CREATE FOREIGN TABLE raw_hits ( hit_time TIMESTAMP, page TEXT ) SERVER file_fdw OPTIONS (format 'csv', delimiter ';', filename '/var/log/hits.log');
L: in 9.1: FDW
CREATE TABLE hits_2011041617 AS SELECT page, count(*) FROM raw_hits WHERE hit_time > '2011-04-16 16:00:00' AND hit_time <= '2011-04-16 17:00:00' GROUP BY page;
T: temporary tables
CREATE TEMPORARY TABLE ON COMMIT DROP sales_records_june_rollup AS SELECT seller_id, location, sell_date, sum(sale_amount), array_agg(item_id) FROM raw_sales WHERE sell_date BETWEEN '2011-06-01' AND '2011-06-30 23:59:59.999' GROUP BY seller_id, location, sell_date;
CREATE UNLOGGED TABLE cleaned_log_import AS SELECT hit_time, page FROM raw_hits, hit_watermark WHERE hit_time > last_watermark AND is_valid(page);
T: stored procedures
multiple languages
SQL PL/pgSQL PL/Perl PL/Python PL/PHP PL/R PL/Java allows you to use exernal data processing libraries in the database
CREATE OR REPLACE FUNCTION normalize_query ( queryin text ) RETURNS TEXT LANGUAGE PLPERL STABLE STRICT AS $f$ # this function "normalizes" queries by stripping out constants. # some regexes by Guillaume Smet under The PostgreSQL License. local $_ = $_[0]; #first cleanup the whitespace s/\s+/ /g; s/\s,/,/g; s/,(\S)/, $1/g; s/^\s//g; s/\s$//g; #remove any double quotes and quoted text s/\\'//g; s/'[^']*'/''/g; s/''('')+/''/g; #remove TRUE and FALSE s/(\W)TRUE(\W)/$1BOOL$2/gi; s/(\W)FALSE(\W)/$1BOOL$2/gi; #remove any bare numbers or hex numbers s/([^a-zA-Z_\$-])-?([0-9]+)/${1}0/g; s/([^a-z_\$-])0x[0-9a-f]{1,10}/${1}0x/ig; #normalize any IN statements s/(IN\s*)\([\'0x,\s]*\)/${1}(...)/ig; #return the normalized query return $_; $f$;
CREATE OR REPLACE FUNCTION f_graph2() RETURNS text AS ' sql <- paste("SELECT id as x,hit as y FROM mytemp LIMIT 30",sep=""); str <- c(pg.spi.exec(sql)); mymain <- "Graph 2"; mysub <- paste("The worst offender is: ",str[1,3]," with ",str[1,2]," hits",sep=""); myxlab <- "Top 30 IP Addresses"; myylab <- "Number of Hits"; pdf(''/tmp/graph2.pdf''); plot(str,type="b",main=mymain,sub=mysub,xlab=myxlab,ylab =myylab,lwd=3); mtext("Probes by intrusive IP Addresses",side=3); dev.off(); print(''DONE''); ' LANGUAGE plr;
ELT Tips
bulk insert into a new table instead of updating/deleting an existing table update all columns in one operation instead of one at a time use views and custom functions to simplify your queries inserting into your long-term tables should be the very last step no updates after!
regular aggregate
windowing function
TABLE events ( event_id INT, event_type TEXT, start TIMESTAMPTZ, duration INTERVAL, event_desc TEXT );
SELECT MAX(concurrent) FROM ( SELECT SUM(tally) OVER (ORDER BY start) AS concurrent FROM ( SELECT start, 1::INT as tally FROM events UNION ALL SELECT (start + duration), -1 FROM events ) AS event_vert) AS ec;
UPDATE partition_name SET drop_month = dropit FROM ( SELECT round_id, CASE WHEN ( ( row_number() over (partition by team_id order by team_id, total_points) ) <= ( drop_lowest ) ) THEN 0 ELSE 1 END as dropit FROM ( SELECT team.team_id, round.round_id, month_points as total_points, row_number() OVER ( partition by team.team_id, kal.positions order by team.team_id, kal.positions, month_points desc ) as ordinal, at_least, numdrop as drop_lowest FROM partition_name as rdrop JOIN round USING (round_id) JOIN team USING (team_id) JOIN pick ON round.round_id = pick.round_id and pick.pick_period @> this_period LEFT OUTER JOIN keep_at_least kal ON rdrop.pool_id = kal.pool_id and pick.position_id = any ( kal.positions ) WHERE rdrop.pool_id = this_pool AND team.team_id = this_team ) as ranking WHERE ordinal > at_least or at_least is null ) as droplow WHERE droplow.round_id = partition_name .round_id AND partition_name .pool_id = this_pool AND dropit = 0;
SELECT round_id, CASE WHEN ( ( row_number() OVER (partition by team_id order by team_id, total_points) ) <= ( drop_lowest ) ) THEN 0 ELSE 1 END as dropit FROM ( SELECT team.team_id, round.round_id, month_points as total_points, row_number() OVER ( partition by team.team_id, kal.positions order by team.team_id, kal.positions, month_points desc ) as ordinal
avoid scanning large tables multiple times and MB of data transmission (for some data mining tasks)
Postgres partitioning
partitions are also full tables explicit constraints define the range of the partion triggers or RULEs handle insert/update
CREATE TABLE sales ( sell_date TIMESTAMPTZ NOT NULL, seller_id INT NOT NULL, item_id INT NOT NULL, sale_amount NUMERIC NOT NULL, narrative TEXT );
CREATE TABLE sales_2011_06 ( CONSTRAINT partition_date_range CHECK (sell_date >= '2011-06-01' AND sell_date < '2011-07-01' ) ) INHERITS ( sales );
CREATE FUNCTION sales_insert () RETURNS trigger LANGUAGE plpgsql AS $f$ BEGIN CASE WHEN sell_date < '2011-06-01' THEN INSERT INTO sales_2011_05 VALUES (NEW.*) WHEN sell_date < '2011-07-01' THEN INSERT INTO sales_2011_06 VALUES (NEW.*) WHEN sell_date >= '2011-07-01' THEN INSERT INTO sales_2011_07 VALUES (NEW.*) ELSE INSERT INTO sales_overflow VALUES (NEW.*) END; RETURN NULL; END;$f$; CREATE TRIGGER sales_insert BEFORE INSERT ON sales FOR EACH ROW EXECUTE PROCEDURE sales_insert();
Postgres partitioning
Good for:
Bad for:
rolling off data DB maintenance queries which use the partition key under 300 partitions insert performance
administration queries which do not use the partition key JOINs over 300 partitions update performance
sets your storage requirements lets you project how queries will run when database is full people don't like talking about deleting data
complex/expensive queries frequently referenced often part of a query automagic support not complete yet
SELECT page, COUNT(*) as total_hits FROM hit_counter WHERE date_trunc('day', hit_date) BETWEEN ( now() AND now() - INTERVAL '7 days' ) ORDER BY total_hits DESC LIMIT 10;
CREATE TABLE page_hits ( page TEXT, hit_day DATE, total_hits INT, CONSTRAINT page_hits_pk PRIMARY KEY(hit_day, page) );
each day: INSERT INTO page_hits SELECT page, date_trunc('day', hit_date) as hit_day, COUNT(*) as total_hits FROM hit_counter WHERE date_trunc('day', hit_date) = date_trunc('day', now() - INTERVAL '1 day') ORDER BY total_hits DESC;
SELECT page, total_hits FROM page_hits WHERE hit_date BETWEEN now() AND now() - INTERVAL '7 days';
maintaining matviews
BEST: GOOD: BAD for DW: update matviews at batch load time update matview according to clock/calendar update matviews using a trigger
matview tips
1/10 to of RAM
Contact
blog: blogs.ittoolbox.com/database/soup pgexperts: www.pgexperts.com pgCon: Ottawa: May 17-20 OpenSourceBridge: Portland: June
This talk is copyright 2010 Josh Berkus and is licensed under the creative commons attribution license. Special thanks for materials to: Elein Mustain (PL/R), Hitoshi Harada and David Fetter (windowing functions), Andrew Dunstan (file_FDW)
PostgreSQL: www.postgresql.org
Upcoming Events