You are on page 1of 8

Column Oriented Database Technologies

Posted by Dale Anderson | On July 24th, 2012 | In Big Data, Web & Software Development |
Tags: Column oriented database
Column Oriented Database Technologies (click to download)
My recent blog (Big Data & NoSQL Technologies) discussed various NoSQL technologies and
market vendors. Today lets dive into column-oriented databases and why they should play an important
role in any data warehouse whose focus is on aggregations or metrics (and whose isnt?).
So you are all probably familiar with row-oriented databases. Tables of data where rows of fields (also
called columns) represent the structural storage and the corresponding SQL queries that select, insert,
update, and delete that data. Most database vendors like Oracle, Microsoft, Sybase, Informix, and many
others all base their technology on this ANSI standard. Column-oriented databases are indeed what you
might surmise; tables of data where columns of data values represent the structural storage. What you
might not expect is that on the surface many column-oriented databases look and feel like row oriented
databases also using SQL queries in much the same way. Creating tables, storing data, querying them
are all pretty much identical. They may appear similar, but two principal things to understand is that the
significant differences under the hood, in particular, physical storage and query optimization.

As noted in my previous blogs on NoSQL, there is also a column-store technology out there. Lets not
confuse that with column oriented databases. They are different. Since several NoSQL column-store
vendors were highlighted before, we will focus instead on the column oriented database vendors here.
First, some key benefits to column oriented databases:
High performance on aggregation queries (like COUNT, SUM, AVG, MIN, MAX)
Highly efficient data compression and/or partitioning
True scalability and fast data loading for Big Data
Accessible by many 3rd party BI analytic tools
Fairly simple systems administration
Due to their aggregation capabilities which compute large numbers of similar data items, column oriented
databases offer key advantages for certain types of systems, including:
Data Warehouses and Business Intelligence
Customer Relationship Management (CRM)
Library Card Catalogs
Ad hoc query systems

Column oriented database technology has actually been around for many years originating in 1969 with
an application called TAXIR which provided abstracts for mathematical biosciences. In 1976, Canada
implemented the RAPID system for processing and retrieval of population and housing census statistics.
Sybase IQ was the only commercially available column-oriented database for many years, yet that has
changed rapidly in the last few years. Lets take a quick look at some of todays key players:
SAP Sybase IQ (www.sybase.com)

A highly optimized analytics server designed specifically to deliver superior performance for missioncritical business intelligence, analytics and data warehousing solutions on any standard hardware and
operating system. Its column oriented grid-based architecture, patented data compression, and
advanced query optimizer delivers high performance, flexibility, and economy in challenging reporting and
analytics environments.

o
o
o
o
o
o
o
o
o

Essentially a data partitioned, index based storage technology, Sybase IQs engine offers several key
features which include:
Web enabled analytics
Communications & Security
Fast Data Loading
Query Engine supporting Full Text Search
Column Indexing Sub System
Column Storage Processor
User Friendly CUI based Administration & Monitoring
Multiplex Grid Architecture
Information Live-cycle management
The Sybase IQ Very Large Data Base (VLDB) option provides partitioning and placement where a table
can have a specified column partition key with value ranges. This partition allows data that should be
grouped together to be grouped together and separates data where they should be separated. The
drawback to this methodology is that it is not always known which is which.
Infobright (www.infobright.com)

Offering both a commercial (IEE) and a free community (ICE) edition, the combination of a column
oriented database with their Knowledge Grid architecture delivers a self-managed, scalable, high
performance analytics query platform. Allowing 50Tb using a single server, their industry-leading data
compression (10:1 up to 40:1) significantly reduces storage requirements and expensive hardware

infrastructures. Delivered as a MySQL engine, Infobright runs on multiple operating systems and
processors needing only a minimum of 4Gb of RAM (however 16Gb is a recommended starting point).

Avoiding partition schemes, Infobright data is stored in data packs, each node containing pre-aggregated
statistics about the data stored within them. The Knowledge Grid above provides related metadata
providing a high level view of the entire content of the database. Indexes, projections, partitioning or
aggregated tables are not needed as these metadata statistics are managed automatically. The granular
computing engine processes queries using the Knowledge Grid information to optimize query processing
eliminating or significantly reducing the amount of data required for decompressing and access to answer
a query. Some queries may not need to access the data at all, finding instead the answer in the
Knowledge
Grid
itself.

The Infobright Data Loader is highly efficient so data inserts are very fast. This performance gain does
come at a price so avoid updates unless absolutely necessary, design de-normalized tables, and dont
plan on any deletes. New features to the data loader include a reject option which allows valid rows to

commit while invalid rows are logged. This is highly useful when loading millions of rows and only having
a few rows with bad data. Without this feature the entire data load would be rolled back.
Vertica (HP) (www.vertica.com)

Recently acquired by Hewlett Packard, this platform was purpose built from the ground up to enable data
values having high performance real-time analytics needs. With extensive data loading, queries,
columnar storage, MPP architecture, and data compression features, diverse communities can develop
and scale with a seamless integration ecosystem.

o
o
o
o
o
o
o
o
o

Claiming elasticity, scale, performance, and simplicity the Vertica analytics platform uses transformation
partitioning to specify which rows belong together and parallelism for speed. Several key features
include:
Columnar Storage & Execution
Real-Time Query & Loading
Scale-out MPP Architecture
Automatic High Availability
Aggressive Data Compression
Extensible In-Database Analytics Framework
In-Database Analytics Library
Database Designer & Administration Tools
Native BI & ETL support for MapReduce & Hadoop
The Vertica Optimizer is the brains of the analytics platform producing optimal query execution plans
where several choices exist. It does this through traditional considerations like disk I/O and further
incorporates CPU, memory, network, concurrency, parallelism factors and the unique details of the
columnar operator and runtime environment.
ParAccel (www.paraccel.com)

Analytic-driven companies need a platform, not just a database where speed, agility, and complexity drive
the data ecosystem. The ParAccel Analytic Platform streamlines the delivery of complex business
decisions through its high performance analytic database. Designed for speed, its extensible framework
supports on-demand integration and embedded functions.

The ParAccel Database (PADB) present four main components: the Leader node, the Compute node,
the Parallel Communications Fabric, and an optional Storage Area Network (SAN). The Leader controls
the execution of the Compute nodes and all nodes communicate with each other via the Fabric running
on standard x86 Linux servers. Each Compute node is subdivided into a set of parallel processes called
slices that include a CPU core, and thier allocation of memory, and local disk storage. The
Communication Fabric provides a low-level MPP network protocol for increased performance.

o
o
o
o
o
o

Key PADB features include:


High Performance & Scalability
Columnar Orientation
Extensible Analytics
Query Compilation
High Availability
Solution Simplicity

ParAccel Integrated Analytics Library and Extensibility Framework incorporates advanced functions along
with an API to add your own functions to help address complex business problems right in the core
database enabling customers to focus upon their specific data complexities.
Microsoft SQL Server 2012 (www.microsoft.com)

Released this year, Microsoft has now embraced the columnar database idea. The latest SQL Server
release 2012 includes xVelocity, a column-store index feature that stores data similar to a columnoriented DBMS. While not a true column oriented database, this technique allows for the creation of a
memory optimized index that groups and stores data for each column then and joins them together to
complete the index. For certain types of queries, like aggregations, the query processor can take
advantage of the column-store index to significantly improve execution times. Column store indexes can
be used with partitioned tables providing a new way to think about how to design and process large
datasets.

The column-store index can be very useful on large fact tables in a Star schema improving overall
performance, however the cost model approach utilized may choose the column-store index for a table
when
a
row
based
index
would
have
been
better.
Using
the
IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX query hint will work around this if it occurs.
When data is stored with a column-store index, data can often be compressed more effectively over a row
based index. This is accomplished as typically there is more redundancy within a column than within a
row. Higher compression means less IO is required to retrieve data into memory which can significantly
reduce response times.


o
o
o
o

o
o
o
o

There are several restrictions and limitation in using a column-store index. For example, which data types
are supported or not and that you can only create one column-store index on any table can be
problematic. Become familiar with what it can do and where best to use it. Currently the column-store
index is not supported on Microsoft Azure.
Column-oriented databases provide significant advantages over traditional row oriented system applied
correctly; In particular for data warehouse and business intelligence environments where aggregations
prevail. It would not be fair however to ignore the disadvantages. Lets look at these two:
Column-Oriented Advantages
Efficient storage and data compression
Fast data loads
Fast aggregation queries
Simplified administration & configuration
Column-Oriented Disadvantages
Transactions are to be avoided or just not supported
Queries with table joins can reduce high performance
Record updates and deletes reduce storage efficiency
Effective partitioning/indexing schemes can be difficult to design

The real value in using column-oriented database technology comes from


high performance, scalable storage and retrieval of large to massive datasets (Big Data) focused on
aggregation queries. Simply put: Reports! You can design Star schemas or Data Vaults (The Data

Vault What is it? Why do we need it?) incorporating these technologies and you will find that
column-oriented databases provide a clear solution in data warehouse and business intelligence.
Look for future blogs on Hadoop/Hive/HBase and Extract-Transform-Load (ETL) technologies, and dont
be afraid to comment, question, or debate, there is always room to learn new things

Read more: http://www.dbbest.com/blog/column-oriented-databasetechnologies/#ixzz3SHHtBYeS


Follow us: @dbbest_tech on Twitter | DBBestTech on Facebook