Exploiting MySQL 5 1 For Advanced Business Intelligence Applications

Exploiting MySQL 5.
1
For Advanced Business
Intelligence Applications
Matt Casters: Chief Architect, Data Integration and
Kettle Project Founder
MySQL User Conference, Tuesday April 24, 2007

Agenda
Great news
Pentaho Data Integration introduction
Pentaho Data Integration MySQL support
Table partitioning
Database partitioning
Clustering
Metadata to the rescue
Pentaho company overview & platform demo
Questions & closing
Great news
The San Francisco Giants beat the Arizona Diamondbacks last

Saturday
Barry Zito pitched a great game
The hot-dogs where great
Pentaho news
Mondrian 2.3. was released a few weeks ago

Detailed cache control (clearing)
Sun Microsystems & Pentaho announced that Pentaho Reporting

will be used to drive Business Intelligence requirements in Open
Office 2.3: to be released in the second half of this year.
Tomorrow we’ll do another one.
Pentaho Data Integration introduction
From source systems …

to the data warehouse …
to reports …
to analyses …
to dashboard reports …
to better information
Data Transformation and Integration
Examples
Data filtering
Is not null, greater than, less than, includes
Field manipulation
Trimming, padding, upper and lowercase conversion
Data calculations
+ - X / , average, absolute value, arctangent, natural logarithm
Date manipulation
First day of month, Last day of month, add months, week of year,
day of year
Data type conversion
String to number, number to string, date to number
Merging fields & splitting fields
Looking up date
Look up in a database, in a text file, an excel sheet, …
Pentaho Data Integration (Kettle)
Components
Spoon
Connect to data sources
Define transformation rules and design target schema(s)
Graphical job execution workflow engine for defining multi-stage and
conditional transformation jobs
Pan
Command-line execution of single, pre-defined transformation jobs
Kitchen
Scheduler for multi-stage jobs
Carte
Remote execution “slave” server with embedded web server
Pentaho BI Platform
Integrated scheduling of transformations or jobs
Ability to call real-time transformations and use output in reports
and dashboards
Data Transformation and Integration
Examples
Demo time : load 1M rows into a database table
Case Study: Pentaho Data Integration
Organization: Flemish Government Traffic Centre

Use case: Monitoring the state of the road network
Application requirement: Integrate minute-by-minute data from
570 highway locations for analysis (1550 lanes)
Technical challenges: Large volume of data, more than 2.5
billion rows
Business Usage: Users can now compare traffic speeds based on
weather conditions, time of day, date, season
Best practices:
Clearly understand business user requirements first
There are often multiple ways to solve data integration problems, so
consider the long-term need when choosing the right way
Summary and Resources
Pentaho and MySQL can address help you manage your data
infrastructure
Extraction, Transformation and Loading for Data Warehousing and Data
Migration
kettle.pentaho.org
Kettle project homepage
kettle.javaforge.com
Kettle community website: forum, source, documentation, tech tips, samples,
…
www.pentaho.org/download/
All Pentaho modules, pre-configured with sample data
Developer forums, documentation
Ventana Research Open Source BI Survey
www.mysql.com
White paper -
http://dev.mysql.com/tech-resources/articles/mysql_5.0_pentaho.html
Kettle Webinar -
http://www.mysql.com/news-and-events/on-demand-webinars/pentaho-2006-09-1
Pentaho Data Integration MySQL support
Reading: MySQL Result Streaming (cursor emulation) support

Writing: MySQL dialects for data types
Job entry: Bulk Loader of text files for MySQL
Job entry: Bulk writer to a text file for MySQL
Table partitioning
Challenges
More data is being gathered all the time
Data is coming from more sources than ever
Faster access to stored information is becoming more important
More people require concurrent access to the data
The old solution

Split up the data into different tables
SALES_2003
SALES_2004
…
SALES_2006
Hardcode reports to use one or the other table
Lots of database management issues
High report maintenance costs
Nothing is automatic, everything is manual
Table partitioning
The new solution:

MySQL 5.1. partitioned tables
Automatic split up of data into smaller chunks
SQL engine that can automatically pull results back together
Table partitioning
Sales table Sales table
Year 2003 Partition
Year 2004 Partition
Year 2005 Partition
Year 2006 Partition

Table partitioning
Advantages
Reduces query time by reducing the amount of data to “plough”
through.
Increases performance by
“Pruning” the list of partitions to search for automatically. This is
done using the MySQL query optimizer based on the query that’s
being issued.
Massive reduction in I/O
Smaller partitioned indexes leading to faster index tree traversal
Allowing parallel access to the different partitions
Table partitioning
How do I create a partitioned table?
CREATE TABLE sales

(
sales_date TIMESTAMP,
ordernr INT,
customernr INT,
productnr INT,
discount_pct TINYINT
)
PARTITION BY RANGE( YEAR(sales_date) )
(
PARTITION P2000 VALUES LESS THAN (2001),
PARTITION P2001 VALUES LESS THAN (2002)
)
Table partitioning
How do I add another partition to the table?
ALTER TABLE sales

ADD PARTITION
(
PARTITION P2002 VALUES LESS THAN (2003)
)
Table partitioning
How do I drop a partition from the table?
ALTER TABLE sales

DROP PARTITION P2002
;
Table partitioning
Retrieving partition information
mysql> desc information_schema.partitions;
+-------------------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------------------+--------------+------+-----+---------+-------+
| TABLE_CATALOG | varchar(512) | YES | | NULL | |
| TABLE_SCHEMA | varchar(64) | NO | | | |
| TABLE_NAME | varchar(64) | NO | | | |
| PARTITION_NAME | varchar(64) | YES | | NULL | |
| SUBPARTITION_NAME | varchar(64) | YES | | NULL | |
| PARTITION_ORDINAL_POSITION | bigint(21) | YES | | NULL | |
| SUBPARTITION_ORDINAL_POSITION | bigint(21) | YES | | NULL | |
| PARTITION_METHOD | varchar(12) | YES | | NULL | |
| SUBPARTITION_METHOD | varchar(12) | YES | | NULL | |
| PARTITION_EXPRESSION | longtext | YES | | NULL | |
| SUBPARTITION_EXPRESSION | longtext | YES | | NULL | |
| PARTITION_DESCRIPTION | longtext | YES | | NULL | |
| TABLE_ROWS | bigint(21) | NO | | 0 | |
| AVG_ROW_LENGTH | bigint(21) | NO | | 0 | |
| DATA_LENGTH | bigint(21) | NO | | 0 | |
| MAX_DATA_LENGTH | bigint(21) | YES | | NULL | |
| INDEX_LENGTH | bigint(21) | NO | | 0 | |
| DATA_FREE | bigint(21) | NO | | 0 | |
| CREATE_TIME | datetime | YES | | NULL | |
| UPDATE_TIME | datetime | YES | | NULL | |
| CHECK_TIME | datetime | YES | | NULL | |
| CHECKSUM | bigint(21) | YES | | NULL | |
| PARTITION_COMMENT | varchar(80) | NO | | | |
| NODEGROUP | varchar(12) | NO | | | |
| TABLESPACE_NAME | varchar(64) | YES | | NULL | |
+-------------------------------+--------------+------+-----+---------+-------+
Table partitioning
Retrieving partition information
SELECT partition_name
FROM information_schema.partitions
WHERE table_name = ’sales’
;
+----------------+
| partition_name |
+----------------+
| P2000 |
| P2001 |
| P2002 |
| P2003 |
| P2004 |
| P2005 |
| P2006 |
| P2007 |
| P2008 |
| P2009 |
| P2010 |
| P2011 |
+----------------+
Table partitioning
Use-case: Flemish Traffic Center

Medium sized data set: 514M rows
Partitioned by week
 PARTITION BY RANGE ( YEAR(mtime)*100 + WEEK(mtime) )
80GB of data in hundreds of partitions

1 row of data: speed, number of vehicles (5 classes), etc.
Per minute
Per lane of traffic
For 1550 detectors
 815M rows of data per year
A test query aggregating speed & counts per road position per minute
Gave back 642.319 rows
9 seconds for MySQL to get the result
29 seconds to pass the data to the client over JDBC
Table partitioning
Should we do table partitioning? Yes!

However…
Even though you can do sub-partitioning, you need to partition the
data in one certain way, by range, by list, …
The entry point for optimization is always going to be that
partitioning method
Pruning is always going to be based on that partitioning method
Queries across the whole data set don’t always get benefit (neither a
penalty)
More detailed information:

Partitioning in MySQL 5.1 and Onwards
Mikael Ronström, Senior Software Architect, MySQL AB
Tomorrow 11:50am – 12:35pm, Ballroom A
How can we improve upon this?

Challenge:
Dealing with even greater data sets
The need to present results even quicker
The old solution
Large boxes
Huge amounts of RAM
Very fast disk systems (Fibre attached, SAN, …)
Multi-CPU monsters
 $$$ Very expensive and complicated systems $$$
The new solution:
2 or more MySQL servers
Basic, off the shelf, hardware
Standard software
Simple configurations
Pentaho Data Integration & MySQL 5.1
Sales
Sales table 2003
2004
Sales
2005
2003
Year 2003 Partition 2006 DB1
2004
2005
Sales
Year 2004 Partition 2006 DB2 2003
2004
Sales
2005
Year 2005 Partition
2003
2006 DB3
2004
Year 2006 Partition 2005
2006 DB4
Split the data over several databases

Pick a splitting mechanism in advance
Divide the load over the databases
Make the databases smaller
Increase the query performance
Increase the load performance
How do you set it up?

First create a partitioned / clustered database connection
Then create a Partitioning Schema

Demo time:
Creating partitions partitioned
Loading data partitioned
Reading back data partitioned
Reading back data partitioned and ordered
Table partitioning
Use-case: Flemish Traffic Center

Medium sized data set: 514M rows
Same partitioning as before
Same table layout, engine and configuration
Partitioned over 3 databases on detector (per lane)
A test query aggregating speed & counts per road position per minute
Gave back 642.319 rows
3 seconds for MySQL to get the result
10 seconds to pass the data to the client over JDBC
 Demonstrating almost linear scalability!!
Table partitioning
Should we do database partitioning? Yes!

However…
The scalability is limited to the raw CPU power that is available on a
single system.
This poses a scalability problem
when dealing with larger numbers of partitions
Reading back large volumes of data
 Enters Pentaho Data Integration: Clustering

Pentaho Data Integration : Clustering
Sales
2003
2004 Server
2005
X
2006 DBx
The challenge is to eliminate CPU bottlenecks

Pentaho Data Integration “Carte” comes to the rescue
Can be started up on any system turning it into a “Slave Server”
The combination of database partitioning and PDI Clustering:
Will split up the number of database partitions among the slave
servers
Will split the total CPU load over the slave servers
Will split the total I/O load over the database partitions (MySQL
instances)
Demo-time
Start up 2 slave servers
Run a step across the 2 servers
Monitor
Pentaho Metadata to the rescue
The problem:
Reporting becomes harder on a database partitioned system
In stead of 1 database you read from a bunch of them
This is where our new project Pentaho Metadata Comes in

Due in a few months
Auto-generates SQL now
Will auto-generate PDI transformations in the near future
Will auto-deliver the data to your report in the platform
Will make it very easy for your users to create drag and drop
reports
Pentaho Introduction
World’s most popular enterprise open source
BI Suite
2 million lifetime downloads, averaging 100K /
month
Founded in 2004: Pioneer in professional open
source BI
Key Projects
JFreeReport Reporting
Kettle Data Integration
Mondrian OLAP
Pentaho BI Platform
Weka Data Mining
Management and Board

Proven BI veterans from Business Objects,
Cognos, Hyperion, SAS, Oracle
Open source leaders - Larry Augustin, New
Enterprise Associates, Index Ventures
MySQL Gold Partner

Strategic
Scorecards
Analysis
Aggregates
Reports
Operational
Sales Marketing Inventory Production Financial
Departmental
Demo and overview
Questions and Closing
?
Birds of a Feather
MySQL Data Warehousing and BI

Tonight at 7:30
Camino Real
With:
Chief Geek James Dixon: Senior architect & CTO, Pentaho
Julian Hyde, OLAP Architect and Mondrian Project Founder, Pentaho
Brian Miezejewski, Principal Consultant, MySQL
Roland Bouman, Certification Developer, MySQL
MySELF™
A few bottles of beer

Exploiting MySQL 5 1 For Advanced Business Intelligence Applications

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Exploiting MySQL 5 1 For Advanced Business Intelligence Applications

Uploaded by

Copyright:

Available Formats

Exploiting MySQL 5.

MySQL User Conference, Tuesday April 24, 2007

The San Francisco Giants beat the Arizona Diamondbacks last

Mondrian 2.3. was released a few weeks ago

Sun Microsystems & Pentaho announced that Pentaho Reporting

From source systems …

Organization: Flemish Government Traffic Centre

Reading: MySQL Result Streaming (cursor emulation) support

The old solution

The new solution:

Sales table Sales table

Year 2003 Partition

Year 2004 Partition

Year 2005 Partition

Year 2006 Partition

How do I create a partitioned table?

CREATE TABLE sales

How do I add another partition to the table?

ALTER TABLE sales

How do I drop a partition from the table?

ALTER TABLE sales

Use-case: Flemish Traffic Center

80GB of data in hundreds of partitions

Should we do table partitioning? Yes!

More detailed information:

How can we improve upon this?

Year 2004 Partition 2006 DB2 2003

Year 2006 Partition 2005

Split the data over several databases

How do you set it up?

Then create a Partitioning Schema

Use-case: Flemish Traffic Center

Should we do database partitioning? Yes!

 Enters Pentaho Data Integration: Clustering

The challenge is to eliminate CPU bottlenecks

This is where our new project Pentaho Metadata Comes in

Management and Board

MySQL Gold Partner

MySQL Data Warehousing and BI

You might also like