Professional Documents
Culture Documents
You are ingesting lots of data. Performance is bottlenecked in the INSERT area. The Problem
Overview of Solution
This will be couched in terms of Data Warehousing, with a huge Fact table and Summary (aggregation)
Injection Speed
tables.
Normalization
Flip-Flop Staging
Overview of Solution ENGINE Choice
Injection Speed
Some variations:
⚈ Big dump of data once an hour, versus continual stream of records.
⚈ The input stream could be single-threaded or multi-threaded.
⚈ You might have 3rd party software tying your hands.
Generally the fastest injection rate can be achieved by "staging" the INSERTs in some way, then batch processing the staged records.
This blog discusses various techniques for staging and batch processing.
Normalization
Let's say your Input has a VARCHAR host_name column, but you need to turn that into a smaller MEDIUMINT host_id in the
Fact table. The "Normalization" table, as I call it, looks something like
Here's how you can use Staging as an efficient way achieve the swap from name to id.
mysql.rjweb.org/doc.php/staging_table 1/6
12/20/2019 Using a Staging Table for Efficient MySQL Data Warehouse Ingestion
host_id MEDIUMINT UNSIGNED NOT NULL,
# This should not be in the main transaction, and it should be done with autocommit = ON
# In fact, it could lead to strange errors if this were part
# of the main transaction and it ROLLBACKed.
INSERT IGNORE INTO Hosts (host_name)
SELECT DISTINCT s.host_name
FROM Staging AS s
LEFT JOIN Hosts AS n ON n.host_name = s.host_name
WHERE n.host_id IS NULL;
By isolating this as its own transaction, we get it finished in a hurry, thereby minimizing blocking. By saying IGNORE, we don't care if
other threads are 'simultaneously' inserting the same host_names.
There is a subtle reason for the LEFT JOIN. If, instead, it were INSERT IGNORE..SELECT DISTINCT, then the INSERT would
preallocate auto_increment ids for as many rows as the SELECT provides. This is very likely to "burn" a lot of ids, thereby leading to
overflowing MEDIUMINT unnecessarily. The LEFT JOIN leads to finding just the new ids that are needed (except for the rare
possibility of a 'simultaneous' insert by another thread). More rationale: Mapping table (in Index Cookbook)
SQL #2:
This gets the IDs, whether already existing, set by another thread, or set by SQL #1.
If the size of Staging changes depending on the busy versus idle times of the day, this pair of SQL statements has another comforting
feature. The more rows in Staging, the more efficient the SQL runs, thereby helping compensate for the "busy" times.
Flip-Flop Staging
The simple way to stage is to ingest for a while, then batch-process what is in Staging. But that leads to new records piling up waiting
to be staged. To avoid that issue, have 2 processes:
⚈ one process (or set of processes) for INSERTing into Staging;
⚈ one process (or set of processes) to do the batch processing (normalization, summarization).
To keep the processes from stepping on each other, we have a pair of staging tables:
⚈ Staging is being INSERTed into;
⚈ StageProcess is one being processed for normalization, summarization, and moving to the Fact table.
A separate process does the processing, then swaps the tables:
This may not seem like the shortest way to do it, but has these features:
⚈ The DROP + CREATE might be faster than TRUNCATE, which is the desired effect.
⚈ The RENAME is atomic, so the INSERT process(es) never find that Staging is missing.
A variant on the 2-table Flip-Flop is to have a separate Staging table for each Insertion process. The Processing process would run
around to each Staging in turn.
A variant on that would be to have a separate Processing process for each Insertion process.
mysql.rjweb.org/doc.php/staging_table 2/6
12/20/2019 Using a Staging Table for Efficient MySQL Data Warehouse Ingestion
The choice depends on which is faster (Insertion or Processing). There are tradeoffs; a single Processing thread avoids some locks, but
lacks some parallelism.
ENGINE Choice
Fact table -- InnoDB, if for no other reason than that a system crash would not need a REPAIR TABLE. (REPAIRing a billion-row
MyISAM table can take hours or days.)
Normalization tables -- InnoDB, primarily because it can be done efficiently with 2 indexes, whereas, MyISAM would need 4 to
achieve the same efficiency.
Confused? Lost? There are enough variations in applications that make it impractical to predict what is best. Or, simply good enough.
Your ingestion rate may be low enough that you don't hit the brick walls that I am helping you avoid.
Should you do "CREATE TEMPORARY TABLE"? Probably not. Consider Staging as part of the data flow, not to be DROPped.
Summarization
Replication Issues
The following allows you to keep more of the Ingestion process in the Master, thereby not bogging down the Slave(s) with writes to the
Staging table.
⚈ RBR
⚈ Staging is in a separate database
⚈ That database is not replicated (binlog-ignore-db on Master)
⚈ In the Processing steps, USE that database, reach into the main db via syntax like "MainDb.Hosts". (Otherwise, the binlog-ignore-
db does the wrong thing.)
That way
⚈ Writes to Staging are not replicated.
⚈ Normalization sends only the few updates to the normalization tables.
⚈ Summarization sends only the updates to the summary tables.
⚈ Flip-flop does not replicate the DROP, CREATE or RENAME.
Sharding
You could possibly spread the data you are trying ingest across multiple machines in a predictable way (sharding on hash, range, etc).
mysql.rjweb.org/doc.php/staging_table 3/6
12/20/2019 Using a Staging Table for Efficient MySQL Data Warehouse Ingestion
Running "reports" on a sharded Fact table is a challenge unto itself. On the other hand, Summary Tables rarely get too big to manage on
a single machine.
Push Me vs Pull Me
I have implicitly assumed the data is being pushed into the database. If, instead, you are "pulling" data from some source(s), then there
are some different considerations.
If you need parallelism in Summarization, you will have to sacrafice the transactional integity of steps 4-7.
Caution: If these steps add up to more than an hour, you are in deep dodo.
It is probably reasonable to have multiple processes doing this, so it will be detailed about locking.
iblog_file_size should be larger than the change in the STATUS "Innodb_os_log_written" across the BEGIN...COMMIT transaction (for
either Case).
References
Postlog
posted Nov, 2014; refreshed Aug, 2015
-- Rick James
mysql.rjweb.org/doc.php/staging_table 4/6
12/20/2019 Using a Staging Table for Efficient MySQL Data Warehouse Ingestion
mysql.rjweb.org/doc.php/staging_table 5/6
12/20/2019 Using a Staging Table for Efficient MySQL Data Warehouse Ingestion
Contact me via LinkedIn; be sure to include a brief teaser in the Invite request:
Did my articles help you out? Like what you see? Consider donating:
☕ Buy me a Banana Latte ($4) There is no obligation but it would put a utf8mb4 smiley 🙂 on my face, instead of the Mojibake
"🙂"
mysql.rjweb.org/doc.php/staging_table 6/6