You are on page 1of 25

Spring 2008 Issue 4

MySQL Magazine

2 A Tour of MySQL Certification 7 On Efficiently Geo-Referencing IPs With MaxMind GeoIP and MySQL GIS 15 Introducing Kickfire 18 Automatic Query Optimization with QOT

2 Book Worm 3 Coding Corner 18 News 24 log-bin

News, articles and feedback for MySQL database administrators and developers using MySQL on a daily basis to provide some of the best database infrastructure available.

Page 2

Spring 2008 Issue 4

MySQL Magazine

By Mark Schoonover

A Tour of MySQL Certification
Certified MySQL Associate (CMA) is the entry level

do with MySQL server using stored procedures, data types, triggers and views, importing and exporting data to name a few. If your task is to develop applications that uses MySQL to store and retrieve data, this certification is what you need.

MySQL certification. Requiring but a single test, if you have experience with other database systems, you'll find your knowledge will apply to the CMA

is the

Essentially part three of the CMDBA certification Administrator. You must pass the CMDBA tests first before taking this exam and should have direct experience with MySQL cluster in a production environment. This certification will go over the NDB storage engine, performance tuning, exam to obtain this certification. backups and recovery and others. Pass a single Certified MySQL Cluster Database

track. If you already understand client/server, data definition language (DDL), basic SQL with grouping, have a huge head start on studying, or you could be ready to take the exam! joining tables and the ability to modify data, you'll

the The CMA will get you started towards working on Certified MySQL Database certificate. This exam is Administrator very (CMDBA)

Resources About the author

comprehensive in what a database administrator should know about MySQL. Keep in mind, this is a should have about six months of daily experience in ability to install and use a MySQL starting point for a DBA, not an ending point. You a moderately sized MySQL environment. Having the environment is crucial too. There will be parts of the environment may not require. learning

Mark Schoonover lives near San Diego, California with his wife, three boys, a neurotic cat, and a retired Greyhound. He's experienced as a DBA, system administrator, network engineer and web developer. He enjoys amateur radio, running marathons, and long distance cycling. He can also at as a referee on the weekends.

curriculum that will go into areas of MySQL that you This certification continues on where the CMA

leaves off. Not only are there more requirements, it goes deeper into the details. You'll be challenged, but if you've created a learning environment, you can try all the commands on a live system to see

be found coaching youth soccer, and getting yelled

how things work. This is a crucial concept, some questions may ask you all the ways to do the same thing. If you know only a single way to accomplish a certain If you're responsible for the daily operation, task, you'll get the question wrong.

Performance MySQL – second edition. The authors

I am excited to have the chance to review High

Book Worm

all have extensive experience in MySQL and it shows. Every section is carefully crafted and covers looking to take his or her skill set to the next level. important topics for an experienced DBA who is Many topics are covered including backups and restores, replication, tuning (OS and MySQL), and significant information on scaling applications using sharding. This edition is updated with a lot will be released in June. of new material. It is well worth a read. The book

tuning and maintenance of a MySQL server, the CMDBA certification is what you want.

aren't language specific, so if you're a programmer Certified MySQL Developer (CMDEV). The exams using Perl, PHP, Java, or connecting via ODBC to a MySQL server, this is the certification you want. You'll cover all the things a developer would need to

Page 3

Spring 2008 Issue 4

MySQL Magazine

C o d i n g o By Peter Brawley Take the Drudgery Out r of Writing Crosstabs n e A crosstab (or pivot table) is a cross-tabulation. It displays results as they distribute across distinct values of two or more variables. Statisticians call it a r contingency table.
table listing product, salesperson and sales totals or amount: USE test; INSERT INTO sales VALUES DROP TABLE IF EXISTS sales; (1,'radio','bob','100.00'), CREATE TABLE sales ( (2,'radio','sam','100.00'), id int(11) default NULL, (3,'radio','sam','100.00'), product char(5) default NULL, (4,'tv','bob','200.00'), salesperson char(5) default NULL, (5,'tv','sam','300.00'), amount decimal(10,2) default NULL (6,'radio','bob','100.00'); ) ENGINE=MyISAM DEFAULT CHARSET=latin1; Is there an organization in the whole world that doesn’t need crosstab reports? Suppose you have a sales

SELECT * FROM sales; +------+---------+-------------+--------+ | id | product | salesperson | amount | +------+---------+-------------+--------+ | 1 | radio | bob | 100.00 | | 2 | radio | sam | 100.00 | | 3 | radio | sam | 100.00 | | 4 | tv | bob | 200.00 | | 5 | tv | sam | 300.00 | | 6 | radio | bob | 100.00 | +------+---------+-------------+--------+ We wish to see sales totals for each salesperson and each product. To do this we write a query which 1. groups results by product, 2. uses a conditional sum to accumulate the sales total of each salesperson in its own column, and

3. creates a horizontal sum column: SELECT product, SUM( CASE salesperson WHEN 'bob' THEN amount ELSE 0 END ) AS 'Bob', SUM( CASE salesperson WHEN 'sam' THEN amount ELSE 0 END ) AS 'Sam', SUM( amount ) AS Total FROM sales GROUP BY product WITH ROLLUP; +---------+--------+--------+--------+ | product | Bob | Sam | Total | +---------+--------+--------+--------+ | radio | 200.00 | 200.00 | 400.00 | | tv | 200.00 | 300.00 | 500.00 | | NULL | 400.00 | 500.00 | 900.00 | +---------+--------+--------+--------+ This crosstab query generates one row per product, one column per salesperson, and a total-by-product column on the far right. The pivoting CASE expressions assign sales.amount values to matching salespersons’ columns. WITH ROLLUP adds a row at the bottom for salesperson sums.

Coding Corner continues on page 4

Page 4

Spring 2008 Issue 4

MySQL Magazine

Coding Corner from page 3 This is a snap for two products and two salespersons. When there are dozens of salespersons, writing the query becomes so tiresome and error-prone It is time to invoke the programmer's fundamental rule: what you always have to do, ensure that you never have to do. Some years ago Giuseppe Maxia published a little query that automates writing these crosstab expressions (see His idea was to embed the syntax for lines like the SUM( CASE ...) lines above in a query for the DISTINCT values. At the time Giuseppe was writing, MySQL did not support stored procedures. Now that it does, we can further generalize Giuseppe's idea by parameterizing it in a stored procedure. Admittedly it is a little daunting. To write a query with variable names rather than the usual literal table and column names we have to write PREPARE statements. Now we propose to write SQL that writes PREPARE statements. Code, which writes code, which writes code. Not a job for the back of a napkin. It is easy enough to write the stored procedure shell. To make sure that all our general purpose stored routines are in one predictable place and available to all databases on a server, we keep generic routines in a sys database, so our crosstab drudge killer needs parameters specifying database, table, pivot column and (in some cases) the aggregating column.
What do we do next? What worked for us was to proceed from back to front: 1. Write the pivot expressions for specific cases. 3. Parameterize the result of number two.

2. Write the PREPARE statement that generates those expressions.
4. Encapsulate the result of number three in a stored procedure.

“It’s time to invoke the programmer's fundamental rule: what you always have to do, ensure that you never have to do.”

and SUM, require different code. Here is the routine for generating COUNT pivot expressions: USE sys; DROP PROCEDURE IF EXISTS writecountpivot; DELIMITER | CREATE PROCEDURE writecountpivot( db CHAR(64), tbl CHAR(64), col CHAR(64) ) BEGIN DECLARE datadelim CHAR(1) DEFAULT '"'; DECLARE singlequote CHAR(1) DEFAULT CHAR(39); DECLARE comma CHAR(1) DEFAULT ','; SET @sqlmode = (SELECT @@sql_mode); SET @@sql_mode=''; SET @sql = CONCAT( 'SELECT DISTINCT CONCAT(', singlequote, ',SUM(IF(', col, ' = ', datadelim, singlequote, comma, col, comma, singlequote, datadelim, comma, '1,0)) AS `', singlequote, comma, col, comma, singlequote, '`', singlequote, ') AS countpivotarg FROM ', db, '.', tbl, ' WHERE ', col, ' IS NOT NULL' ); -- UNCOMMENT TO SEE THE MIDLEVEL CODE: -- SELECT @sql; PREPARE stmt FROM @sql; EXECUTE stmt; DROP PREPARE stmt; SET @@sql_mode=@sqlmode; END; | Coding Corner continues on page 5 DELIMITER ; CALL sys.writecountpivot('test','sales','salesperson');

To complicate matters a bit more, we soon found that different summary aggregations, for example COUNT

Page 5

Spring 2008 Issue 4

MySQL Magazine

Coding Corner from page 4 For our little sales table, this generates ...
SELECT DISTINCT CONCAT(',SUM (IF(salesperson = "',salesperson,'",1,0)) AS `',salesperson,'`') AS countpivotarg FROM test.sales WHERE salesperson IS NOT NULL | and returns... +--------------------------------------------+ | countpivotarg | +--------------------------------------------+ | ,SUM(IF(salesperson = "bob",1,0)) AS `bob` | | ,SUM(IF(salesperson = "sam",1,0)) AS `sam` | +--------------------------------------------+ which we plug into ... SELECT product ,SUM(IF(salesperson = "bob",1,0)) AS `bob` ,SUM(IF(salesperson = "sam",1,0)) AS `sam` ,COUNT(*) AS Total FROM test.sales GROUP BY product WITH ROLLUP; +---------+------+------+-------+ | product | bob | sam | Total | +---------+------+------+-------+ | radio | 2 | 2 | 4 | | tv | 1 | 1 | 2 | | NULL | 3 | 3 | 6 | +---------+------+------+-------+ Not overwhelming for two columns. Very convenient

indeed if there are fifty. A point to notice is that the two levels of code generation create quote mark nesting data value delimiting, we make sure that sql_mode ANSI_QUOTES is not set during code generation. problems. To make the double quote mark ‘"’ available for

What is going on when SHOW PROCESSLIST returns many queries in the “Sleep” status? Sleep rows are when a thread is connected but not doing anything. Sleep rows do not actually pose a problem by themselves. If an application has a connection pool, then sleep rows are expected. If an application does not utilize connection pooling, it is possible that database connections and handlers are not closing properly. So, the MySQL server thinks the thread is still connected, but not sending any queries. That can be problematic -- MySQL has some per-thread memory it allocates, so many unnecessary threads can eat up memory. This might be caused by scripts timing out on the application side before they can close the db connection properly, or just not closing the connection properly. The "interactive_timeout" parameter can be set lower, which allows less time for connections to "Sleep". The defaults are usually 1 day. Note that this accounts for any login to the database, so if the timeout is set to 10 minutes, then when command line login connections will be closed after 10 minutes of being idle. Basically you do not want to set it too short.

Page 6

Spring 2008 Issue 4

MySQL Magazine

Coding Corner from page 4 SUM pivot queries needs a different syntax:
USE sys; DROP PROCEDURE IF EXISTS writesumpivot; DELIMITER | CREATE PROCEDURE writesumpivot( db CHAR(64), tbl CHAR(64), pivotcol CHAR(64), sumcol CHAR(64) ) BEGIN DECLARE datadelim CHAR(1) DEFAULT '"'; DECLARE comma CHAR(1) DEFAULT ','; DECLARE singlequote CHAR(1) DEFAULT CHAR(39); SET @sqlmode = (SELECT @@sql_mode); SET @@sql_mode=''; SET @sql = CONCAT( 'SELECT DISTINCT CONCAT(', singlequote, ',SUM(IF(', pivotcol, ' = ', datadelim, singlequote, comma, pivotcol, comma, singlequote, datadelim, comma, sumcol, ',0)) AS `', singlequote, comma, pivotcol, comma, singlequote, '`', singlequote, ') AS sumpivotarg FROM ', db, '.', tbl, ' WHERE ', pivotcol, ' IS NOT NULL' ); -- UNCOMMENT TO SEE THE MIDLEVEL SQL: -- SELECT @sql; PREPARE stmt FROM @sql; EXECUTE stmt; DROP PREPARE stmt; SET @@sql_mode=@sqlmode; END; | DELIMITER ; CALL writesumpivot('test','sales','salesperson','amount'); +-------------------------------------------------+ | sumpivotarg | +-------------------------------------------------+ | ,SUM(IF(salesperson = "bob",amount,0)) AS `bob` | | ,SUM(IF(salesperson = "sam",amount,0)) AS `sam` | +-------------------------------------------------+ which we insert in the report query … SELECT product ,SUM(IF(salesperson = "bob",amount,0)) AS `bob` ,SUM(IF(salesperson = "sam",amount,0)) AS `sam` ,SUM(amount) AS Total FROM test.sales GROUP BY product; +---------+--------+--------+--------+ | product | bob | sam | Total | +---------+--------+--------+--------+ | radio | 200.00 | 200.00 | 400.00 | | tv | 200.00 | 300.00 | 500.00 | +---------+--------+--------+--------+ Other aggregating functions also require somewhat different code. Maybe you are the reader who will be inspired to write the completely general case? While you’re at it, consider including logic to write the rest of the crosstab query, too. About the author Peter Brawley is president of Artful Software Development and co-author of Getting It Done With MySQL 5.

Page 7

Spring 2008 Issue 4

MySQL Magazine

with MaxMind GeoIP and MySQL GIS
By Jeremy Cole Geo-referencing IPs is, in a nutshell, converting an IP address into the name of some entity owning that IP address. There are a number of reasons you might want to geo-reference IP addresses to country, city, etc. Some examples might be simple ad targeting applications. systems, geographic load balancing, web analytics and many more This is a very common task, but I have never actually seen it

On Efficiently Geo-Referencing Ips







“″. This is a handy IP address, but a very

way for a human to read an inefficient computer way for a

done efficiently using MySQL in the wild. There is a lot of

questionable advice on forums, blogs and other sites out there on this topic. After working with a Proven Scaling customer, I recently so I thought I would publish some hard data and advice for everyone. Unfortunately, R-tree (spatial) indexes have not been added to They should work with InnoDB but they will did some thinking and some performance testing on this problem,

handle IP addresses.




ip from, ip to (integer) — The same start and end IP addresses integers1, as e.g. 50331648. 32-bit

country code — The 2-

InnoDB yet, so the tricks in this entry only work efficiently with MyISAM tables. perform poorly. This is actually OK for the most part, as the geo-

letter ISO country code for the country to which this IP address has been assigned, or strings,

referencing functionality most people need doesn’t really need transactional support, and since the data tables are basically readcorruption in MyISAM due to any server failures isn’t very high. only (there are monthly published updates), the likelihood of




cases as



meaning “Satellite Provider”.

country name — The full country name of the same. country code if you have a lookup table of country codes (including MaxMind’s non-ISO codes), or if you data. make one from the GeoIP This is redundant with the

The Data Provided By MaxMind
several geo-referencing databases. They release both a commercial (for-pay, but affordable) product called GeoIP, and a free version of MaxMind ( is a great company that produces

the same databases, called GeoLite. The most popular of their databases that I have seen used is GeoLite Country. This allows you resides in. The free GeoLite versions are normally good enough, at more accurate. In this article I will refer to both GeoIP and GeoLite as “GeoIP” for simplicity. fields:

look up nearly any IP and find out which country (hopefully) its user about 98% accurate, but the for-pay GeoIP versions in theory are

GeoIP Country is available as a CSV file containing the following ip from, ip to (text) — The start and end IP addresses as text

Efficient Geo-referencing continues on page 8

MySQL, MySQL logos and the Sakila dolphin are registered trademarks of MySQL AB in the United States, the European Union and other countries. MySQL Magazine is not affiliated with MySQL AB, and the content is not endorsed, reviewed, approved nor controlled by MySQL AB.

Page 8

Spring 2008 Issue 4

MySQL Magazine

Efficient Geo-referencing continued from page 7

A Simple Way to Search For an IP

is extremely inefficient, and can’t effectively use indexes (although it can use them, it isn’t efficient). The reason for this is that it’s an open-ended range, and it is impossible to close the range by adding anything to the query. In fact I haven’t been able to meaningfully improve on the performance at all.

Unfortunately, while simple and natural, this construct

(which will be explained in depth later), there

Once the data has been loaded into MySQL

will be a have a table with a range (a lower that range. For example, one row from the looks like: ip_from ip_to country_code 50331648 68257567 US The natural thing that would come to

and upper bound), and some metadata about GeoIP data (without the redundant columns)

A Much Better Solution
While it probably isn’t the first thing that would come to mind, MySQL’s GIS support is actually perfect for this task. Geo-referencing an IP address to a country boils down to “find which range or ranges this item belongs to”, and this in MySQL’s GIS implementation.

can be done quite efficiently using spatial R-tree indexes The way this works is that each IP range of (ip_from,

mind (and in fact the solution offered by MaxMind themselves2) is BETWEEN. A simple query to search for the IP would be: SELECT country_code FROM ip_country WHERE INET_ATON("") BETWEEN ip_from AND ip_to;

ip_to) is represented as a rectangular polygon from (ip_from, -1) to (ip_to, +1) as illustrated on the next page:

Efficient Geo-referencing continues on page 9

Efficient Geo-referencing continues on page 8

Page 9

Spring 2008 Issue 4

MySQL Magazine Pretty cool huh? I will show how to load

Efficient Geo-referencing continued from page 8

the data and get started, then take look at how it performs in the real world, and compare the raw numbers between the two methods.

Loading Data and Preparing For Work
In SQL/GIS terms, each IP range is represented by a 5point rectangular POLYGON like this one, representing the IP range of – POLYGON(( 50331648 -1, 68257567 -1, 68257567 1, 50331648 1, 50331648 -1 )) The search IP address can be represented as a point of (ip, 0), and that point with have a relationship with at least one of the polygons (provided it’s a valid IP and part of the GeoIP database) as illustrated here: data. A POLYGON field will be used to store First, a table must be created to hold the the IP range. Technically, at this point the ip_from and ip_to fields are unnecessary, IPs from the POLYGON field using MySQL functions, they will be kept anyway. This schema can be used to hold the data4: CREATE TABLE ip_country( id INT UNSIGNED NOT NULL auto_increment, ip_poly POLYGON NOT NULL, ip_from INT UNSIGNED NOT NULL, ip_to INT UNSIGNED NOT NULL, country_code CHAR(2) NOT NULL, PRIMARY KEY (id), SPATIAL INDEX (ip_poly)); After the table has been created, the GeoIP data must be loaded into it from the CSV file, downloaded from MaxMind. The GeoIPCountryWhois.csv, LOAD but given the complexity of extracting the

DATA command can do this like so:

Efficient Geo-referencing continues on page 10

It is then possible to search these polygons for a specific point representing an IP address using the GIS spatial relationship function MBRCONTAINS and POINT3 to search for “which polygon contains this point” like this: SELECT country_code FROM ip_country WHERE MBRCONTAINS(ip_poly, POINTFROMWKB(POINT(INET_ATON('' ), 0)));

Have a topic for a future edition of the MySQL Magazine? Send articles and ideas to Keith Murphy, Editor

Page 10

Spring 2008 Issue 4

MySQL Magazine There are a few interesting metrics that I tested for:

Efficient Geo-referencing continued from page 9 LOAD DATA LOCAL INFILE "GeoIPCountryWhois.csv" INTO TABLE ip_country FIELDS TERMINATED BY "," ENCLOSED BY "\"" LINES TERMINATED BY "\n" ( @ip_from_string, @ip_to_string, @ip_from, @ip_to, @country_code, @country_string ) SET id := NULL, ip_from := @ip_from, ip_to := @ip_to, ip_poly := GEOMFROMWKB(POLYGON(LINESTRING( /* clockwise, 4 points and back to 0 */ POINT(@ip_from, -1), /* 0, top left */ POINT(@ip_to, -1), /* 1, top right */ POINT(@ip_to, 1), /* 2, bottom right */ POINT(@ip_from, 1), /* 3, bottom left */ POINT(@ip_from, -1), /* 0, back to start */ ))), country_code := @country_code ; During the load process, the ip_from_string, ip_to_string, and country_string fields are thrown away, as they are redundant. A few GIS functions are used to build the POLYGON for ip_poly from the ip_from and ip_to fields on-the-fly. On my test machine it takes about five seconds to load the 96,641 rows in this month’s CSV file. At this point the data is loaded, and everything is ready to go to use the above SQL query to search for IPs. Try a few out to see if they seem to make sense!





and a

single of




repeatedly querying. Does the



handled increase as the number of clients increases?

Is latency and overall performance adversely clients? affected by


The test consisted of an IP search using the two different methods, and varying sixteen in the following configurations: the number of clients between one and

Clients Machines Threads 1 2 4 8 16 1 1 1 2 4 1 2 4 4 4

Each test finds the country code for a random dotted-quad format IP address passed in as a string.

Performance: The test setup

In order to really test things, a bigger load testing framework will be needed, as well as a few machines to is a Dell PowerEdge 2950 with Dual Dual Core Xeon 5050 @ generate load. In my tests the machine being tested, kamet, 3.00Ghz, and 4 GB RAM. We have four test clients, makalu{0-3}, which are Apple Mac Mini with 1.66Ghz Intel CPUs and 512MB RAM. The machines are all connected with a Netgear JGS524NA 24-port GigE switch. For the purposes of

How does it perform? How does it compare?
determining the performance of these searches. version of this query, you may have doesn’t take very long anyway: I pretty But don’t let that fool you. If you tried the There are a few metrics for BETWEEN

noticed that, in terms of human time, it consistently got one row in set (0.00 sec).

this test, the disk configuration is not important. On the software side, the server is running CentOS 4.5 with kernel 2.6.9-55.0.2.ELsmp. The Grinder 3.0b32 is used as a load 5.1.5 to connect to MySQL 5.0.45. generation tool with a custom Jython script5 and Connector/J

It’s clear that GIS wins hands down.

Efficient Geo-referencing continues on page 11

Page 11

Spring 2008 Issue 4

MySQL Magazine

Efficient Geo-referencing continued from page 10 First a look at raw performance in terms of queries per second. Using BETWEEN, we max out at 264 q/s with sixteen clients:

Using MBRCONTAINS, we max out at 17600 q/s with sixteen clients, and it appears that it’s the test clients that are maxed out, not the server:

Next, a look at latency of the individual responses. Using BETWEEN, we start out with a single client at 15.5 ms per request, which is not very good, but still imperceptible to a human. But with sixteen clients, the latency has jumped to 60 ms, which is longer than latency gets much worse, because the query is so dependent on CPU: many web shops allocate to completely construct a response. As the number of test clients increases, the

Efficient Geo-referencing continues on page 12

Page 12

Spring 2008 Issue 4

MySQL Magazine

Efficient Geo-referencing continued from page 11

Using MBRCONTAINS, we start out with a single client at 0.333 ms per request, and even with sixteen clients, we are well under 1 ms at 0.743 ms:

Performance is fantastic, and it’s relatively easy to use. Even if you are an all-InnoDB shop, as most of our customers are (and we would recommend), it may very well be worth it to use MyISAM specifically for this purpose. Definitely consider using MySQL GIS whenever you need to search for a point within a set of ranges.

Update 1: Another way to do it, and a look at performance
Andy Skelton and Nikolay Bachiyski left a comment below suggesting another way this could be done: SELECT country_code FROM ip_country WHERE ip_to >= INET_ATON('%s') ORDER BY ip_to ASC Efficient Geo-referencing continues on page 13 LIMIT 1

Page 13

Spring 2008 Issue 4

MySQL Magazine

Efficient Geo-referencing continued from page 12 This version of the query doesn’t act exactly the same as the other two — if your search IP is not part of any range, it will return the next highest range. You will have to check whether ip_from is <= your IP within your own code. It may be possible to do this in MySQL directly, but I haven’t found a way that doesn’t kill the performance. Andy’s version actually performs quite well — slightly faster and more scalable than MBRCONTAINS. I added two new performance testing configurations to better show the differences between the two:
Clients 32 64 Machines 4 4 Threads 8 16

Here’s a performance comparison of MBRCONTAINS vs. Andy’s Method: Latency (ms) — Lower is better:

Queries per second — Higher is better:

Efficient Geo-referencing continues on page 14

Page 14

Spring 2008 Issue 4

MySQL Magazine

Efficient Geo-referencing continued from page 13
Once I get some more time to dig into this, I will look at why exactly BETWEEN is so slow. I’ve also run into an interesting possible bug in MySQL: If you add a LIMIT 1 to the BETWEEN version of the query, performance goes completely to hell. Huh? Thanks for the feedback, Andy and Nikolay.

DriverManager.registerDriver(Driver()) p = grinder.getProperties() p.setLong("grinder.threads", 8) p.setLong("grinder.runs", 100000000) p.setLong("grinder.duration", 120 * 1000) t = Test(1, "Query") def getConnection(): return DriverManager.getConnection( "jdbc:mysql://server/geoip", "geoip", "geoip") class TestRunner: def __init__(self): self.connection = getConnection() def __call__(self): r = Random() s = self.connection.createStatement() q = t.wrap(s) ip = "%i.%i.%i.%i" % ((r.nextInt() % 256), (r.nextInt() % 256), (r.nextInt() % 256), (r.nextInt() % 256)) # Using BETWEEN #q.execute("select country_code from ip_country_bad where inet_aton('%s') between ip_from and ip_to" % ip ) # Using MBRCONTAINS #q.execute("select country_code from ip_country where mbrcontains(ip_poly, pointfromwkb(point(inet_aton('%s'), 0)))" % ip ) s.close() def __del__(self): self.connection.close() About the Author Jeremy Cole is MySQL Geek and co-founder at Proven Scaling where he consults on architecture, performance, scalability, reliability, and availability. Geek at Yahoo! Inc., where he was responsible for







INET_NTOA() functions for converting back and

forth between dotted-quad strings (CHAR(15)) and 32-bit integers (INT UNSIGNED). You can also use the equivalent functions, if they exist, if your favorite programming language so that you can (positive) performance implications of doing that.

just feed an integer to MySQL. I haven’t tested the Although, strangely they offer a different

solution specifically for MySQL using <= and >= operators instead of BETWEEN. I don’t find that that difference has any effect on MySQL. Maybe it have BETWEEN?

was for a really old version of MySQL that didn’t Pet peeve: Why does MySQL require you to pass

the output of its own POLYGON, LINESTRING, order to use them? It makes life suck that little bit more than necessary.

POINT, etc., functions through GEOMFROMWKB in

the BETWEEN version of things, you will want to add some indexes on ip_from and ip_to. I would (ip_to, ip_from) as those two seemed to perform to start with).

Note that if you’re looking to play around with

recommend INDEX (ip_from, ip_to) and INDEX the best that I could find (given its poor efficiency Custom Jython Script listing:

from net.grinder.script.Grinder import grinder from net.grinder.script import Test from java.util import Random from java.sql import DriverManager from com.mysql.jdbc import Driver

Prior to starting Proven Scaling, Jeremy was MySQL internal consulting and support, and maintained the internal releases of MySQL. Previous to Yahoo! Inc, Jeremy worked for over four years at MySQL AB.

Page 15

Spring 2008 Issue 4

MySQL Magazine


Introducing Kickfire

Kickfire Database Appliance will be released in Beta version on April 14, 2008. While I have talked capabilities I have not had a chance to work extensively with Kickfire about the product and its hands-on with the Kickfire product. Benchmarks in this article were independently execute and audited. That being said, I have seen a demo It does system running the benchmark tests. perform as reported. organization audited these test. when thinking “data

The reader needs to understand that the

In addition, the TPC

MySQL is not the first thing to come to mind warehousing”. When

discussing hundreds and thousands of terabytes of data are in storage companies tend to think more about Oracle or other specialized servers. That might be about to change... Clara, Calif., is gearing up

non-profit organization was formed call the Transaction Processing Performance Council (TPC). They have developed standard tests that can be used to compare performance of database servers on a variety of hardware configurations. One of the test developed by TPC is call TPC-H. The test is defined in Wikipedia, ”The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.” Wikipedia continues by saying “The performance metric reported by TPC-H is called the TPC-H Composite Query-per-Hour Performance Metric (QphH@Size), and reflects multiple aspects of the capability of the system to process queries. These aspects include the selected database size against which the queries are executed, the query processing power when queries are submitted by a single stream, and the query throughput when queries are submitted by multiple concurrent users. The TPC-H Price/Performance metric is expressed as $/QphH@Size.” Now that some background has been

A small Silicon Valley company, based in Santa revolutionary new product called the Kickfire Database Appliance. Do not let the name mislead you. This product could turn the entire database world on its collective ear. It is a new paradigm how to best utilize it. to release a

for thinking about SQL and data warehousing and Before we look into the details of what the

Kickfire Database Appliance entails, it is useful to what the discussion is all about. First, what does a

define a few terms so that everyone understands data warehousing mean? According to Wikipedia,

'data warehouse' is a repository of an organization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis.” This is fairly easy to understand. Data warehouses are a fairly new paradigm in the database world with the concept being introduced in the mid '80s by Barry Devlin and Paul Murphy, researchers at IBM. It wasn't until the mid-90s that data warehousing came into significant use.
When benchmarking data warehousing tools it is important to use standard tests. In 1988 a

established, it is time to answer the ever pressing question - what's the big deal? The big deal is a chip. out? To provide some context, remember

graphics accelerator chips when they first came software manipulation from the operating system level onto a chip. Over time this has resulted in has done the equivalent of this for MySQL. An dramatic improvements in performance. Kickfire expansion card that includes a SQL accelerator chip plugs into a server. With the addition of the They transferred some of the graphics

Kickfire continues on page 16

Page 16

Spring 2008 Issue 4

MySQL Magazine Typically an

Kickfire continued from page 15 Linux operating system, MySQL and additional software that acts as glue – the end result is a system that performs significantly faster. The additional chip is called a SQL Chip. The SQL Chip is a device that will at the same time both execute queries in parallel and accelerate their execution time. As this is done directly in the SQL Chip, it offloads the base or host server. There are three basic components to a Kickfire Database Appliance. The first part is what we are all familiar with – the MySQL server. Next, there is a software component that Kickfire provides – a new storage engine for MySQL, including optimizer components. Finally, the SQL Chip provides acceleration of the SQL execution. MySQL provides: ● connectivity ● security ● administration
Kickfire software provides:
● ● ●

accelerate reads from the database.

Online Transaction Processing (OLTP) website will contain a significant mix of reads and writes. warehousing where the dataset does not change from the database(es). That is why this product is targeted at data that much and the application is primarily reading The first generation will accelerate queries on

up to three terabytes of data on a single node. Over time this is expected to increase to ten's of terabytes. Kickfire currently has audited results of their

TPC-H benchmarks on a 100 gigabyte dataset.

Three systems were compared: a Sybase IQ/Sun

system ($45,467), a Microsoft SQL Server/HP system ($19,437) and a MySQL/Kickfire system ($34,300). The performance of these system was compared in two ways, Query Geometric Mean (the geometric mean of the 22 queries in the TCP hour). benchmark) and QphH (Queries executed per With Query G.M. the comparison is as follows: Sybase IQ/Sun Microsoft/HP MySQL/Kickfire 43.60 64.40 5.32


column store and cache transactional engine

SQL Chip provides:
● ● ●

SQL execution

memory management loader acceleration

In this case smaller (execution times) is going to be better. Clearly Kickfire is the winner. With the QphH comparison: Sybase IQ/Sun Microsoft/HP MySQL/Kickfire 8,587 49,267 4,521

Utilizing the Kickfire Database Appliance a company can perform many tasks that would normally be performed by a high-end Oracle server in a data warehousing setup. to release some higher-end

What does the future hold? Kickfire has plans units that are

equipped to handle larger datasets. Also, Kickfire is looking at integrating multiple SQL Chips in a higher-throughput solutions. system which should ultimately help build even Is this going to replace the MySQL server

With the QphH metric a larger number is better. Once again, Kickfire is the winner being over 5.5 times faster than the Sybase system and over 10.8 times faster than the MSFT system. one data warehousing system in terms Kickfire is now certified by the TPC as the number price/performance and the number one system in of

powering a busy Internet website? No. The reason

why is that the function of the SQL Chip is to

Kickfire continues on page 17

Page 17

Spring 2008 Issue 4

MySQL Magazine

Kickfire continued from page 16 terms of performance for non-clustered (singlenode) systems. Kickfire did one other independently audited test run. Percona Consulting ran the TPC-H test on a Dell server running MySQL and also a Kickfire system (the same one used in the other TPC testing). Here are the results: Query G.M. MySQL/Dell ($15,000) 5,188.00 MySQL/Kickfire ($34,300) 5.32
Once again, smaller numbers are better. Clearly

the MySQL/Kickfire server is far faster. In fact, some

of the queries running on the MySQL/Dell server were killed because they were taking more than three hours to run. co-founder,

In the words of Raj Cherabuddi, Kickfire's CEO and “The problem today is not query

performance. There are lots of vendors that solve that problem. At Kickfire, we wanted to deliver high cooling, and space costs of existing offerings. That's why what takes competitors half a room of hardware to do, we can do in a double height pizza box with the power consumption of a microwave oven. “ company. To solve problems in an efficient manner you ordinarily have to spend large sums of money. Kickfire wants to solve problems in an efficient manner without spending more money than many mortgages. Some might wonder why Kickfire chose MySQL instead of another database. Raj had this to say about their It is a very safe statement to say that this is what sets Kickfire apart from any other data warehousing performance without the massive hardware, power,

choice of MySQL “We chose MySQL not only because it is the leading open source database but, perhaps more importantly, it has emerged as a de facto standard. Being a standard, MySQL enjoys extensive third-party Kickfire can leverage.” vendor and service provider support and is surrounded by an ecosystem of skills and knowledge all of which Kickfire is something new. They bring an approach to resolving the problems of data warehousing that I

don't believe has ever been tried before. I am confident in saying that with the level of performance offered, the pricing is going to be very low when compared to other options available on the market today. I might be pulling out the crystal ball, but I see a very bright future for Kickfire. Kickfire Homepage:
TPC Benchmark Resources:

Page 18

Spring 2008 Issue 4

MySQL Magazine

Automatic Query Optimization with QOT
This article is about QOT - Query analysis an Optimization Tool for MySQL. The initial idea behind QOT was to create an open source tool that would help DBAs and developers to create optimal indexes based on existing table structure and queries that will be executed by the application. Later when the development was started it appeared that along with index generation the tool can provide several additional capabilities based on the query analysis that it performs. This includes error checking, SQL static checking, query rewrites, query reverse engineering and more... Some of the mentioned features are already available, other are still in development or are just planned. In this article I will focus on the currently available features (as of version 0.0.2). I will start with the primary feature – index generation.

Automatic Index Generation

Whenever you write a new SELECT query you always face the same question - how fast will it perform, or in other words how good will the query fit the existing data and index structure. To answer this question you need to have an idea about several facts such as:
● ● ● ●

what is the current data structure? what indexes are already defined? how can MySQL optimizer use existing indexes with the new query? which indexes can be added to improve query performance?

Needless to say that gathering and analyzing all this information is not a trivial task and this is where QOT can be useful. The tool can analyze the existing data and index structure and find out which indexes will be available to the optimizer during the execution of the query. Also the tool can automatically generate the missing indexes to improve query performance. We will consider several use cases. Just before we start let me mention that the tool works fully autonomously and doesn’t require a MySQL server. All the input data (queries and DDL) is provided via files and directly on the command line. In the examples below I will use a sample schema consisting of two tables that could serve as a simple music catalog (I will assume that the below model definition is stored in file music.sql): CREATE SCHEMA music; USE music; CREATE TABLE bands ( id INT NOT NULL PRIMARY KEY, band_title VARCHAR(45)); CREATE TABLE albums ( id INT NOT NULL PRIMARY KEY, band_id INT NOT NULL, album_title VARCHAR(45), rating INT NOT NULL DEFAULT ‘3’, INDEX ix0 (band_id)); started by Matt Reid. It is a blog aggregate and news site.

How about a new MySQL information site:


QOT continues on page 19

Page 19

Spring 2008 Issue 4

MySQL Magazine

Kickfire continued from page 18 This is a simple model with a one-to-many relationship. Tables have primary keys defined. Field band_id in table albums serves to store the relationship. Thus it has an index. This is a very typical design that you would get using a ER design tool. Now I will write several queries for the music catalog and use QOT to optimize them. Query one will fetch the ids of all albums of a given band:
SELECT id FROM music.albums WHERE band_id = 3; Run QOT (I won’t focus on command line options, as I hope their meaning is not too hard to guess. Please refer to the tool documentation for details): $ qot \ --input-file="music.sql" \ --input-query="SELECT id FROM music.albums WHERE band_id = 3;" \ --info /* Output produced by qot 0.0.2 GPL */ /* Query: SELECT id FROM music.albums WHERE band_id = 3 selectivity: zero or more rows tables used in this query: albums (zero or more rows) existing lookup indexes for this query: albums.ix0(band_id) existing covering indexes for this query: (none) */ The default output mode is SQL, that is why all free-text information is enclosed in comments (more on this later). This query did not produce any error messages, so right under the query text you can find the information about query selectivity. This value is calculated per table based on the indexes used and the WHERE-condition. Next goes the information about the existing indexes that can be used with this query. Lookup indexes are the indexes that can be used to resolve the WHERE-condition. Covering indexes are ones that can be used to completely avoid table access, another words, a covering index can be used to both resolve WHERE-condition and to return the SELECTed item list. The tool reports that there is an index (ix0) that can be used by the optimizer for lookup and there are no covering indexes defined. Overall this output means that the optimizer will have only one index to consider for this query. Also this means that for every row that matches the WHERE-condition (i.e. for every found music album) MySQL will perform a table access which implies at least one additional random I/O for every found row. This may (depending on the table state) force optimizer to resort to the full table scan making the existing index practically unusable. The optimizer would not have to perform table access if all the query fields were included in the index. So it looks like a covering index (i.e. an index that includes all query fields) would save the situation. Let’s use the tool to generate it:

QOT continues on page 20

Page 20

Spring 2008 Issue 4

MySQL Magazine

QOT continued from page 19
$ qot \ --input-file="music.sql" \ --input-query="SELECT id FROM music.albums WHERE band_id = 3;" \ --propose=cov-index /* Output produced by qot 0.0.2 GPL */ /* additional covering indexes that can be created to improve query performance */ CREATE UNIQUE INDEX ix1 ON `music`.`albums` ( band_id, id); QOT has generated an index which can be used as a covering index for this query. Notice that the output is

valid SQL so you can redirect it to the input of the mysql monitor or execute in Query Browser. You may wonder why the generated index is UNIQUE. That is because it includes the primary key id column which is unique. If you apply the new index to the server and run an EXPLAIN SELECT command for the above query you will see that the optimizer really likes the new index: mysql> explain SELECT id FROM music.albums WHERE band_id = 3 \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: albums type: ref possible_keys: ix1,ix0 key: ix1 key_len: 4 ref: const rows: 2 Extra: Using index 1 row in set (0.00 sec) Notice the Using_index value in the Extra column. It indicates precisely what we wanted to achieve – that MySQL will not touch the table. Should we always generate the missing covering keys? Well, it depends… the major factors are the query selectivity ratio and the time saved in the query with the new covering index versus the additional time lost in UPDATE, INSERT and DELETE queries for index maintenance. The future versions of QOT will give hints to help make the right decision with this. Now that we have an idea about generating covering indexes it is time to try something more complex. In the next example there will be queries that do not have lookup indexes. We will see how to fix this using QOT.

QOT continues on page 21

Page 21

Spring 2008 Issue 4

MySQL Magazine

QOT continued from page 20
SELECT album_title FROM albums WHERE rating = 5; SELECT album_title FROM albums WHERE band_id = 1 AND

Want to learn about MySQL? For free? MySQL University is in session!

rating = 5;

SELECT band_title FROM albums INNER JOIN bands ON ( WHERE where band_id = 1 AND rating = 5; Run QOT: wiki/MySQL_University Sessions scheduled every Thursday at 15:00 CET

$ qot \ --input-file="music.sql" \ --input-query="select album_title from albums where rating = 5;\ select album_title from albums where band_id = 1 and rating = 5;\ select band_title from albums, bands where band_id = 1 \ and rating = 5 and = band_id;" \ --info \ --propose=merged-index /* Output produced by qot 0.0.2 GPL */ /* Query: select album_title from albums where rating = 5 selectivity: zero or more rows tables used in this query: albums (zero or more rows) existing lookup indexes for this query: (none) existing covering indexes for this query: (none) */ /* Query: select album_title from albums where band_id = 1 and rating = 5 selectivity: zero or more rows tables used in this query: albums (zero or more rows) existing lookup indexes for this query: albums.ix0(band_id) existing covering indexes for this query: (none) */ QOT continues on page 22 /*

Page 22

Spring 2008 Issue 4

MySQL Magazine

QOT continued from page 21
Query: select band_title from albums, bands where band_id = 1 and rating = 5 and = band_id selectivity: zero or more rows tables used in this query: bands (zero or more rows), albums (zero or more rows) existing lookup indexes for this query: bands.PRIMARY(id) albums.ix0(band_id) existing covering indexes for this query: (none) */ /* additional merged lookup indexes that can be created to improve performance of *all* the above queries */ CREATE INDEX index0 ON `music`.`albums` ( rating, band_id); The report indicates that the first query cannot use any indexes at all. This means that the optimizer will have to resort to full table scan. The second query has some relevant indexes but the indexes are not optimal as they do not cover all the fields from the WHERE-condition. The same is true for the last query. Finally QOT proposes to create a single index on the albums table that will fix problems with all three queries. Let me explain why this index would be so good. The main reason is that the index fits the pattern of the WHERE conditions of all these queries. In this form the index can be used with both expressions like (rating = 5) and (band_id = 1 and rating = 5). Neither reversed index (band_id, rating) nor separate indexes for band_id and rating columns would be as good here. With this example I will conclude the description of index generation features. I will not cover the topic of per-query lookup index generation, because it is very similar to the last example with the exception that the indexes are generated separately for every query. This makes sense if you want per-query optimization.

Other Features
Besides the index generation this release includes query rewrites and static SQL checking.



Query rewriting is modifying of query SQL source. There is only one rewrite available at the moment – the expansion of ‘*’ wildcards into field lists. Keeping wildcards in your queries might be sometimes unwanted. For example, if you later add a column to the table it will be automatically fetched by all the queries that use the ‘*’ wildcard, which is often not desired. QOT can help you convert those into fixed field lists. You can do that manually by running the tool and copy/pasting the output or if you code editor supports some kind of scripting with the possibility to run shell commands you can write a simple script for this task. In order to facilitate such scripting QOT supports XML output mode:

QOT continues on page 23

Page 23

Spring 2008 Issue 4

QOT continued from page 22
$ qot \ --input-file="music.sql" \ --input-query="select * from albums" \ --rewrite=expand-stars \ --output-mode=xml <?xml version="1.0" encoding="UTF-8"?> <qot version="0.0.2 GPL"> <query> <rewritten-query> <sql><![CDATA[select `music`.`albums`.`id`, `music`.`albums`.`band_id`, `music`.`albums`.`album_title`, `music`.`albums`.`rating` from albums]]></sql></rewritten-query> </query></qot>



Static checking is the process of testing source code for constructs that are formally correct but might cause problems during code execution. At the moment QOT is able to analyze SQL for type conversion related problems. Consider the following query: SELECT * FROM albums WHERE id='0' It might look quite innocent, but running the tool: $ qot \ --input-file="music.sql" \ --input-query="select * from albums where id='0'" \ --static-checks /* Output produced by qot 0.0.2 GPL */ /* Static query checking results: WARNING at 'id='0'': string to numeric conversion, all alphabetic values will evaluate to 0 */ The warning suggests that the literal will be parsed as a string - not a number – and the server will perform conversion of the string value ‘0’ into integer during query evaluation. This is, of course, a performance issue but that is not the only problem. The values like ‘0’ or ‘123’ will be converted as expected, but the conversion result might be surprising for values like ‘1abc’ or ‘a’. The first will be converted to 1 and the second to 0.

QOT continues on page 24

Page 24

Spring 2008 Issue 4

MySQL Magazine

QOT continued from page 23

Using QOT in Physical Database Design
Usually physical database design involves defining indexes on tables. Indexes are usually defined based on relations between entities. But there is something a bit surprising about this. If you scroll back to the above examples you will see that in four fairly common queries the original index that was created on the band_id field appeared to be suboptimal. Moreover, the newly generated indexes totally superseded it. The only thing it now does is take additional maintenance time. Why this happened? In general the reason for such problems is that the index design was based on the model relations and not on the queries that were to be executed on the model. Shouldn’t all the indexes be designed based on the queries that they are created for? Yes, this seems reasonable but there is a problem - complexity. Even for a mid-size database application there could be hundreds of queries with various filtering conditions and figuring out the correct indexes manually or mentally based on the set queries is a hard, error-prone work. That is why model designers resort to indirect approximate methods. For example if we know that customers and orders have a relation then it is likely that there will be a query that fetches records based on this relation, etc. This approach works but it is obviously suboptimal, as there’s no guaranteed correspondence between indexes and queries. A more optimal way to design indexes is to avoid defining indexes based on entity relations and define manually only primary keys and other semantic constraints. After the table structure with primary keys is defined and set of queries that the application will use to fetch data is known, all this information can be passed to QOT to generate optimal indexes. Besides the direct gain of getting optimal index structure and having all your queries checked for potential problems you make further development of your database much easier. Every time database structure changes you just remove old indexes and rerun the tool. That is it for now. I hope you find the tool and this article interesting and valuable. To get more information about the tool visit project’s home page About the Author Vladimir Kolesnikov is a software engineer at MySQL AB. He is involved in the development of the MySQL Workbench.

one billion USD in stock and cash earlier this year. The deal was finalized just a few weeks ago. So there now you know. And I thought I would take the time to write a couple of paragraphs about how this will affect the MySQL community down the road. There has been much virtual ink (blog entries) written about this deal. Some people feel like this could be In case you don't get your MySQL news from anywhere else – I have a shocker. Sun bought MySQL AB for

the best thing that could ever happen to MySQL. Some people probably think this is the death knell of MySQL. Me?I certainly don't think this is the “end” for MySQL. I am somewhere in the middle. Sure things will change, but change is part of life. Examining the worst case scenario first, here is what I see. Let's say Sun pretty much kills MySQL, runs off

the core developers and basically buries it. That is very unlikely considering the amount of money invested but lets assume this is what they do.

log-bin continues on page 25

log-bin continued from page 24
possibly as complex as anything that runs on That means that the project can “fork”, taking the While the code of MySQL is very complex,

MySQL provides the same or better performance on the same hardware. And with a beefed up support program through Sun large companies will feel safe put their data in the hands of a MySQL database. This will only create more opportunities for MySQL DBAs. Somewhere in between these two views is Sun is a corporation. MySQL isn't small

Linux (including the kernel), it is still open source. current codebase of MySQL and going off and

playing with it on our own..maybe even under an new name. Even if the server changed names the project could conceivably continue. Even with this worst case it is entirely possible (and very probably) that MySQL continues. What is the best case? Do you realize that Sun

probably where the future is found. gargantuan

anymore. Meshing these two companies together will create some problems. There will be MySQL employees who leave. Adjustments will be

has some of the best hardware and software worked extensively with Sun servers running

necessary. Some operational aspects will probably not be as smooth as we would like. However, in growth. This cross-pollination I wrote about will that comes out of this marriage will be stronger 500 company in the United States. the end, there will be new opportunities for occur even if only on a limited basis. The product for it. Sun probably has servers in every Fortune

engineers in the world? Back in 2000 and 2001 I Solaris. These servers provided incredible amount up to 64 CPUs and 64 gigabytes of RAM. improve.

of processing power at the time. They supported hardware and software has only continued to Their

these engineers to help improve the MySQL codebase? What if engineers who have worked on parallel-processing MySQL server? problems contributing And it would be only natural to to

What if Sun “cross-pollinates” some of

Now that

MySQL is a Sun product there will be new opportunities for MySQL in these companies. MySQL can continue to embrace future

technology and extend its reach ever deeper into

improve the ability of MySQL to run on Sun hardware with Solaris (the number three platform for MySQL server). In addition, over the last several years, Sun has made a commitment to Linux. Did you know that Sun sells a server that supports up to 16 cores of processing power and a whopping 256 GB of RAM and runs either Linux or Solaris? Meet the SunFire X4600.

That is where I am placing my bet. Feel free to write me with your thoughts at
data centers everywhere. Thanks, Keith

I know that it is very important that MySQL There are MySQL

scales out. It does this very well. But there is also opportunities for MySQL to scale up. boxes.

companies who run Oracle on some very high end Sun engineers working with

engineers can help make it possible to replace Oracle on those servers and save companies cut cost. millions of dollars. What a way for a company to Sure it cost money to migrate large datasets and applications from Oracle to MySQL. But with the astronomical licensing fees of Oracle it makes perfect sense for companies to do this if