With the given workload, compression and latematerialization improves performance by a factor of two andthree respectively [12]. We believe these results are largelyorthogonal to ours, since we heavily compress both the HPand VP and our workload does not lend itself to late
materialization of tuples. “Comparison of Row Stores andColumn Stores in a Common Framework” mainly focused on
super-tuple
and column abstraction
.
Slotted page format in HPresults in less compression ratio than VP [10]. Super-tuplesmay improve the compression ratio by storing rows with oneheader with no slot-array. Column abstraction
avoids storingrepeated attributes multiple times by adding information to theheader. Comparison is made over varying number of columnswith uniformly distributed data for VP and HP, whileretrieving all columns from table.The VP concept has implemented in Decomposition storagemodel (DSM), with storage design of (tuple id, attributevalues) for each column (MonetDB) [9]. C-Store data modelcontains overlapping projections of tables. L2 cache behaviourmay improved by PAX architecture, focused on storing tuplescolumn-wise on each slot [7], with penalty of I/O bandwidth.Data Morphing improves on PAX to give even better cacheperformance by dynamically adapting attribute groupings onthe page [11].
B.
Database Compression Techniques
Compression techniques in database is mostly based on slottedpage HP. Compression ratio may be improved up to 8-12 byusing processing intensive techniques [13]. VP compression
ratio is examined by “Superscalar RAM
-CPU Cache
Compression” and “Integrating Compression and Execution
in Column-
Oriented Database Systems” [21, 3]
. Zukowskipresented an algorithm for compression optimization theusability of modern processor with less I/O bandwidth. Effectof run lengths on degree of compression and dictionaryencoding proven to be best compression scheme for VP [3].
III.
P
ERFORMANCE
M
EASURING
F
ACTORS
Our contribution to existing approach is based on the majorfactors affecting the performance of HP and VP (a)DataDistribution (b)Cardinality (c)Number of columns(d)Compression Technique and (e) Query nature.
A.
Data Characteristics
The search time, and performance of two relational tablesvaries with number of attributes, data type of each attributealong with the compression ratio, column cardinality andselectivity.
B.
Compression Techniques
Dictionary based coding
The repeated occurrences are replaced by a codeword thatpoints to the index of the dictionary that contains the pattern.Both code words and uncompressed instructions are part of compressed program. Performance penalty occurs for (a)Dictionary cache line is bigger than processors L1 data cache(b) Index size is larger than value and (c) Un-encoded columnsize is smaller than the size of the encoded column plus thesize of the dictionary [3].
Delta coding
The data is stored, as the difference between successivesamples (or characters). The first value in the delta encodedfile is the same as the first value in the original data. All thefollowing values in the encoded file are equal to the difference(delta) between the corresponding value in the input file, andthe previous value in the input file. For uniform values in thedatabase, delta encoding for data compression is beneficial.Delta coding may be performed on both column level andtuple level. For unsorted sequence and size-of(encoded) islarger than size-of(un-encoded), delta encoding is lessbeneficial [3].
Run Length Encoding (RLE)
The
sequences of the same data values within a file is replacedby a count number and a single value. RLE compressionworks best for sorted sequence, long runs. RLE is morebeneficial for VP [3].
C.
Query Parameters and Table Generation
To study the effect of queries with table characteristics,queries were tested with varying number of predicates andselectivity factor. Factors affecting the execution plan and costare (a)Schema definition (b) Selectivity factor (c) Number of columns referenced (d) Number of predicates. The executiontime of a query change with column characteristics and I/Obandwidth. For each characteristic of column, the querygenerator randomly selects the columns used to produce a set
of “equivalent” queries with the cost analysis [12].
Performance measure with compression is implemented by:
Generation of uncompressed HP version of eachtable with primary key on left most column.
Sorted on columns frequently used in query.
Replica is generated on VP.
IV.
I
MPLEMENTATION
D
ETAIL
To study the effect of VP and HP, the experiments are doneagainst TPC-H standard Star-Schema on MonetDB.We mainly concentrated on the fact table i.e. Sales, containsapproximately 10L records. We focused on five columns forselectivity i.e. prod_id, cust_id, time_id, channel_id, promo_idwith selectivity varying from 0.1 to 50%.SELECT p.product_name,ch.channel_class,c.cust_city, t.calendar_quarter_desc,SUM(s.amount_sold) sales_amountFROM sales s, times t, customers c, channels ch,products p, promotions prWHERE s.time_id = t.time_idAND s.prod_id=p.prod_idAND s.cust_id = c.cust_idAND s.channel_id = ch.channel_id
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 10, October 201199http://sites.google.com/site/ijcsis/ISSN 1947-5500