You are on page 1of 38

Exploiting MySQL 5.

1
For Advanced Business
Intelligence Applications
Matt Casters: Chief Architect, Data Integration and
Kettle Project Founder

MySQL User Conference, Tuesday April 24, 2007
Agenda
Great news
Pentaho Data Integration introduction
Pentaho Data Integration MySQL support
Table partitioning
Database partitioning
Clustering
Metadata to the rescue
Pentaho company overview & platform demo
Questions & closing
Great news

The San Francisco Giants beat the Arizona Diamondbacks last
Saturday
Barry Zito pitched a great game
The hot-dogs where great
Pentaho news

Mondrian 2.3. was released a few weeks ago
Detailed cache control (clearing)

Sun Microsystems & Pentaho announced that Pentaho Reporting
will be used to drive Business Intelligence requirements in Open
Office 2.3: to be released in the second half of this year.
Tomorrow we’ll do another one.
Pentaho Data Integration introduction

From source systems …
to the data warehouse …
to reports …
to analyses …
to dashboard reports …
to better information
Data Transformation and Integration
Examples
Data filtering
Is not null, greater than, less than, includes
Field manipulation
Trimming, padding, upper and lowercase conversion
Data calculations
+ - X / , average, absolute value, arctangent, natural logarithm
Date manipulation
First day of month, Last day of month, add months, week of year,
day of year
Data type conversion
String to number, number to string, date to number
Merging fields & splitting fields
Looking up date
Look up in a database, in a text file, an excel sheet, …
Pentaho Data Integration (Kettle)
Components
Spoon
Connect to data sources
Define transformation rules and design target schema(s)
Graphical job execution workflow engine for defining multi-stage and
conditional transformation jobs
Pan
Command-line execution of single, pre-defined transformation jobs
Kitchen
Scheduler for multi-stage jobs
Carte
Remote execution “slave” server with embedded web server
Pentaho BI Platform
Integrated scheduling of transformations or jobs
Ability to call real-time transformations and use output in reports
and dashboards
Data Transformation and Integration
Examples
Demo time : load 1M rows into a database table
Case Study: Pentaho Data Integration

Organization: Flemish Government Traffic Centre
Use case: Monitoring the state of the road network
Application requirement: Integrate minute-by-minute data from
570 highway locations for analysis (1550 lanes)
Technical challenges: Large volume of data, more than 2.5
billion rows
Business Usage: Users can now compare traffic speeds based on
weather conditions, time of day, date, season
Best practices:
Clearly understand business user requirements first
There are often multiple ways to solve data integration problems, so
consider the long-term need when choosing the right way
Summary and Resources
Pentaho and MySQL can address help you manage your data
infrastructure
Extraction, Transformation and Loading for Data Warehousing and Data
Migration

kettle.pentaho.org
Kettle project homepage

kettle.javaforge.com
Kettle community website: forum, source, documentation, tech tips, samples,

www.pentaho.org/download/
All Pentaho modules, pre-configured with sample data
Developer forums, documentation
Ventana Research Open Source BI Survey

www.mysql.com
White paper -
http://dev.mysql.com/tech-resources/articles/mysql_5.0_pentaho.html
Kettle Webinar -
http://www.mysql.com/news-and-events/on-demand-webinars/pentaho-2006-09-1
Pentaho Data Integration MySQL support

Reading: MySQL Result Streaming (cursor emulation) support
Writing: MySQL dialects for data types
Job entry: Bulk Loader of text files for MySQL
Job entry: Bulk writer to a text file for MySQL
Table partitioning

Challenges
More data is being gathered all the time
Data is coming from more sources than ever
Faster access to stored information is becoming more important
More people require concurrent access to the data

The old solution
Split up the data into different tables
SALES_2003
SALES_2004

SALES_2006
Hardcode reports to use one or the other table
Lots of database management issues
High report maintenance costs
Nothing is automatic, everything is manual
Table partitioning

The new solution:
MySQL 5.1. partitioned tables
Automatic split up of data into smaller chunks
SQL engine that can automatically pull results back together
Table partitioning

Sales table Sales table

Year 2003 Partition

Year 2004 Partition

Year 2005 Partition

Year 2006 Partition
Table partitioning

Advantages
Reduces query time by reducing the amount of data to “plough”
through.
Increases performance by
“Pruning” the list of partitions to search for automatically. This is
done using the MySQL query optimizer based on the query that’s
being issued.
Massive reduction in I/O
Smaller partitioned indexes leading to faster index tree traversal
Allowing parallel access to the different partitions
Table partitioning

How do I create a partitioned table?

CREATE TABLE sales
(
sales_date TIMESTAMP,
ordernr INT,
customernr INT,
productnr INT,
discount_pct TINYINT
)
PARTITION BY RANGE( YEAR(sales_date) )
(
PARTITION P2000 VALUES LESS THAN (2001),
PARTITION P2001 VALUES LESS THAN (2002)
)
Table partitioning

How do I add another partition to the table?

ALTER TABLE sales
ADD PARTITION
(
PARTITION P2002 VALUES LESS THAN (2003)
)
Table partitioning

How do I drop a partition from the table?

ALTER TABLE sales
DROP PARTITION P2002
;
Table partitioning
Retrieving partition information
mysql> desc information_schema.partitions;
+-------------------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------------------+--------------+------+-----+---------+-------+
| TABLE_CATALOG | varchar(512) | YES | | NULL | |
| TABLE_SCHEMA | varchar(64) | NO | | | |
| TABLE_NAME | varchar(64) | NO | | | |
| PARTITION_NAME | varchar(64) | YES | | NULL | |
| SUBPARTITION_NAME | varchar(64) | YES | | NULL | |
| PARTITION_ORDINAL_POSITION | bigint(21) | YES | | NULL | |
| SUBPARTITION_ORDINAL_POSITION | bigint(21) | YES | | NULL | |
| PARTITION_METHOD | varchar(12) | YES | | NULL | |
| SUBPARTITION_METHOD | varchar(12) | YES | | NULL | |
| PARTITION_EXPRESSION | longtext | YES | | NULL | |
| SUBPARTITION_EXPRESSION | longtext | YES | | NULL | |
| PARTITION_DESCRIPTION | longtext | YES | | NULL | |
| TABLE_ROWS | bigint(21) | NO | | 0 | |
| AVG_ROW_LENGTH | bigint(21) | NO | | 0 | |
| DATA_LENGTH | bigint(21) | NO | | 0 | |
| MAX_DATA_LENGTH | bigint(21) | YES | | NULL | |
| INDEX_LENGTH | bigint(21) | NO | | 0 | |
| DATA_FREE | bigint(21) | NO | | 0 | |
| CREATE_TIME | datetime | YES | | NULL | |
| UPDATE_TIME | datetime | YES | | NULL | |
| CHECK_TIME | datetime | YES | | NULL | |
| CHECKSUM | bigint(21) | YES | | NULL | |
| PARTITION_COMMENT | varchar(80) | NO | | | |
| NODEGROUP | varchar(12) | NO | | | |
| TABLESPACE_NAME | varchar(64) | YES | | NULL | |
+-------------------------------+--------------+------+-----+---------+-------+
Table partitioning
Retrieving partition information
SELECT partition_name
FROM information_schema.partitions
WHERE table_name = ’sales’
;
+----------------+
| partition_name |
+----------------+
| P2000 |
| P2001 |
| P2002 |
| P2003 |
| P2004 |
| P2005 |
| P2006 |
| P2007 |
| P2008 |
| P2009 |
| P2010 |
| P2011 |
+----------------+
Table partitioning

Use-case: Flemish Traffic Center
Medium sized data set: 514M rows
Partitioned by week
 PARTITION BY RANGE ( YEAR(mtime)*100 + WEEK(mtime) )

80GB of data in hundreds of partitions
1 row of data: speed, number of vehicles (5 classes), etc.
Per minute
Per lane of traffic
For 1550 detectors
 815M rows of data per year

A test query aggregating speed & counts per road position per minute
Gave back 642.319 rows
9 seconds for MySQL to get the result
29 seconds to pass the data to the client over JDBC
Table partitioning

Should we do table partitioning? Yes!
However…
Even though you can do sub-partitioning, you need to partition the
data in one certain way, by range, by list, …
The entry point for optimization is always going to be that
partitioning method
Pruning is always going to be based on that partitioning method
Queries across the whole data set don’t always get benefit (neither a
penalty)

More detailed information:
Partitioning in MySQL 5.1 and Onwards
Mikael Ronström, Senior Software Architect, MySQL AB
Tomorrow 11:50am – 12:35pm, Ballroom A

How can we improve upon this?
Database partitioning
Challenge:
Dealing with even greater data sets
The need to present results even quicker
The old solution
Large boxes
Huge amounts of RAM
Very fast disk systems (Fibre attached, SAN, …)
Multi-CPU monsters
 $$$ Very expensive and complicated systems $$$
The new solution:
2 or more MySQL servers
Basic, off the shelf, hardware
Standard software
Simple configurations
Pentaho Data Integration & MySQL 5.1
Sales
Database partitioning
Sales table 2003

2004
Sales
2005
2003
Year 2003 Partition 2006 DB1
2004

2005
Sales

Year 2004 Partition 2006 DB2 2003

2004
Sales
2005
Year 2005 Partition
2003
2006 DB3
2004

Year 2006 Partition 2005

2006 DB4
Database partitioning

Split the data over several databases
Pick a splitting mechanism in advance
Divide the load over the databases
Make the databases smaller
Increase the query performance
Increase the load performance
Database partitioning

How do you set it up?
First create a partitioned / clustered database connection

Then create a Partitioning Schema
Database partitioning

Demo time:
Creating partitions partitioned
Loading data partitioned
Reading back data partitioned
Reading back data partitioned and ordered
Table partitioning

Use-case: Flemish Traffic Center
Medium sized data set: 514M rows
Same partitioning as before
Same table layout, engine and configuration
Partitioned over 3 databases on detector (per lane)

A test query aggregating speed & counts per road position per minute
Gave back 642.319 rows
3 seconds for MySQL to get the result
10 seconds to pass the data to the client over JDBC
 Demonstrating almost linear scalability!!
Table partitioning

Should we do database partitioning? Yes!
However…
The scalability is limited to the raw CPU power that is available on a
single system.
This poses a scalability problem
when dealing with larger numbers of partitions
Reading back large volumes of data

 Enters Pentaho Data Integration: Clustering
Pentaho Data Integration : Clustering

Sales

2003

2004 Server
2005
X
2006 DBx
Pentaho Data Integration : Clustering

The challenge is to eliminate CPU bottlenecks
Pentaho Data Integration “Carte” comes to the rescue
Can be started up on any system turning it into a “Slave Server”
The combination of database partitioning and PDI Clustering:
Will split up the number of database partitions among the slave
servers
Will split the total CPU load over the slave servers
Will split the total I/O load over the database partitions (MySQL
instances)
Pentaho Data Integration : Clustering

Demo-time
Start up 2 slave servers
Run a step across the 2 servers
Monitor
Pentaho Metadata to the rescue

The problem:
Reporting becomes harder on a database partitioned system
In stead of 1 database you read from a bunch of them

This is where our new project Pentaho Metadata Comes in
Due in a few months
Auto-generates SQL now
Will auto-generate PDI transformations in the near future
Will auto-deliver the data to your report in the platform
Will make it very easy for your users to create drag and drop
reports
Pentaho Introduction
World’s most popular enterprise open source
BI Suite
2 million lifetime downloads, averaging 100K /
month
Founded in 2004: Pioneer in professional open
source BI

Key Projects
JFreeReport Reporting
Kettle Data Integration
Mondrian OLAP
Pentaho BI Platform
Weka Data Mining

Management and Board
Proven BI veterans from Business Objects,
Cognos, Hyperion, SAS, Oracle
Open source leaders - Larry Augustin, New
Enterprise Associates, Index Ventures

MySQL Gold Partner
Pentaho Introduction
Strategic

Scorecards

Analysis

Aggregates

Reports

Operational
Sales Marketing Inventory Production Financial

Departmental
Pentaho Introduction
Demo and overview
Questions and Closing

?
Birds of a Feather

MySQL Data Warehousing and BI
Tonight at 7:30
Camino Real
With:
Chief Geek James Dixon: Senior architect & CTO, Pentaho
Julian Hyde, OLAP Architect and Mondrian Project Founder, Pentaho
Brian Miezejewski, Principal Consultant, MySQL
Roland Bouman, Certification Developer, MySQL
MySELF™
A few bottles of beer