Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more ➡
Download
Standard view
Full view
of .
Add note
Save to My Library
Sync to mobile
Look up keyword
Like this
1Activity
×
0 of .
Results for:
No results containing your search query
P. 1
Vertical Vs Horizontal Partition: In Depth

Vertical Vs Horizontal Partition: In Depth

Ratings: (0)|Views: 988|Likes:
Published by ijcsis
For the write-intensive operations and predictable behavior of queries, the traditional database system have optimize performance considerations. With the growing data in database and unpredictable nature of queries, write optimize system are proven to be poorly designed. Recently, the interest in architectures that optimize read performance by using Vertically Partitioned data representation has been renewed. In this paper, we identify the components affecting the performance of Horizontal and Vertical Partition, with the analysis. Our study focusing on tables with different data characteristics and complex queries. We show that carefully designed Vertical Partition may outperform carefully designed Horizontal Partition, sometimes by an order of magnitude.
For the write-intensive operations and predictable behavior of queries, the traditional database system have optimize performance considerations. With the growing data in database and unpredictable nature of queries, write optimize system are proven to be poorly designed. Recently, the interest in architectures that optimize read performance by using Vertically Partitioned data representation has been renewed. In this paper, we identify the components affecting the performance of Horizontal and Vertical Partition, with the analysis. Our study focusing on tables with different data characteristics and complex queries. We show that carefully designed Vertical Partition may outperform carefully designed Horizontal Partition, sometimes by an order of magnitude.

More info:

Published by: ijcsis on Nov 25, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See More
See less

11/25/2011

pdf

text

original

 
Vertical Vs Horizontal Partition: In Depth
Tejaswini Apte
 
Dr. Maya Ingle
 
Dr. A.K.Goyal
 
Sinhgad Institute of Business Devi Ahilya VishwaVidyalay Devi Ahilya VishwaVidyalayAdministration and Research Indore IndoreKondhwa(BK), Pune-411048maya_ingle@rediffmail.com goyalkcg@yahoo.com
 Abstract
-For the write-intensive operations and predictablebehavior of queries, the traditional database system haveoptimize performance considerations. With the growing data indatabase and unpredictable nature of queries, write optimizesystem are proven to be poorly designed. Recently, the interest inarchitectures that optimize read performance by using VerticallyPartitioned data representation has been renewed. In this paper,we identify the components affecting the performance of Horizontal and Vertical Partition, with the analysis. Our studyfocusing on tables with different data characteristics andcomplex queries. We show that carefully designed VerticalPartition may outperform carefully designed HorizontalPartition, sometimes by an order of magnitude.
General Terms: Algorithms, Performance, Design Keywords: Vertical Partition, Selectivity, Compression, Horizontal  Partition
I.
 
I
NTRODUCTION
 
Storing relational tables vertically on disk has been of keeninterest as observed in data warehouse research community.The main reason lies in minimizing time required for disk reads for tremendously growing data warehouse. VerticalPartition (VP) possesses better cache management with lessstorage overhead. For queries retrieving more columns, VPdemands stitching of the columns back together, offset the I/Obenefits, potentially causing a longer response time than thesame query on the Horizontal Partition (HP). HP stores tupleson physical blocks with slot array, specifies the offset of thetuple on the page [15]. HP approach is superior for queries,retrieve more columns and on transactional databases. Forqueries, retrieves less columns (DSS systems) HP approachmay result in more I/O bandwidth, poor cache behavior andpoor compressible ratio [6].Current up-gradation of database technology has improved HPcompression ratio by storing the tuples densely in the block,with poor updatable ratio and improved I/O bandwidth thanVP. To achieve degree of HP compression close to entropy of table, skewed dataset and advance compression techniquesopened the research path for response time of queries and HPperformance for DSS systems [16].Previous research shown results relevant to this paper are:
 
HP is superior than VP, at less selectivity when queryretrieves more columns with no chaining and thesystem is CPU constrained.
 
Selectivity factor and number of retrieved columns isthe measure of processing time of VP than HP.
 
VP may be sensitive to the amount of processingneeded to decompress a column.Compression ratio may be improved for non-uniformdistribution [13]. Research community mainly focused onsingle predicate with less selectivity, applied to the firstcolumn of the table, and the same is retrieved by the query[12]. We believe that the relative performance of VP and HPis affected by (a) Number of Predicates (b) Predicatesapplication on columns and Selectivity (c) Resultant Columns.Our approach mainly focusing on factors, affecting responsetime of HP and VP i.e. (a) Additional Predicate (b) DataDistribution (c) Join Operation.For various applications, it has been observed that VP hasseveral advantages over HP. We discuss related, existing andrecent compression techniques of HP and VP in Section 2.Many factors affects the performance of HP and VP. Section 3provides the comparative study of performance measure withquery characteristics. Our approach's implementation detailand analysis of the result is presented in Section 4. Finally, weconclude with a short discussion of our work in Section 5.
 
II.
 
R
ELATED
W
ORK
 
In this section, some existing compression techniques used inVP and HP have been discussed briefly along with the latestmethodologies.
 A.
 
Vertical Storage
The VP and HP comparison is presented with C-Store and StarSchema Benchmark [12]. VP is implemented usingcommercial relational database systems by making eachcolumn its own table. The idea presented had to pay moreperformance penalty, since every column must have its ownrow-id. To prove the superiority of HP over VP, analysis hasdone by implementing HP in C-store (VP database).Compression, late materialization and block iteration were thebase of measure for the performance of VP over HP.
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 10, October 201198http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
With the given workload, compression and latematerialization improves performance by a factor of two andthree respectively [12]. We believe these results are largelyorthogonal to ours, since we heavily compress both the HPand VP and our workload does not lend itself to late
materialization of tuples. “Comparison of Row Stores andColumn Stores in a Common Framework” mainly focused on
super-tuple
 
and column abstraction
.
Slotted page format in HPresults in less compression ratio than VP [10]. Super-tuplesmay improve the compression ratio by storing rows with oneheader with no slot-array. Column abstraction
 
avoids storingrepeated attributes multiple times by adding information to theheader. Comparison is made over varying number of columnswith uniformly distributed data for VP and HP, whileretrieving all columns from table.The VP concept has implemented in Decomposition storagemodel (DSM), with storage design of (tuple id, attributevalues) for each column (MonetDB) [9]. C-Store data modelcontains overlapping projections of tables. L2 cache behaviourmay improved by PAX architecture, focused on storing tuplescolumn-wise on each slot [7], with penalty of I/O bandwidth.Data Morphing improves on PAX to give even better cacheperformance by dynamically adapting attribute groupings onthe page [11].
 B.
 
 Database Compression Techniques
Compression techniques in database is mostly based on slottedpage HP. Compression ratio may be improved up to 8-12 byusing processing intensive techniques [13]. VP compression
ratio is examined by “Superscalar RAM
-CPU Cache
Compression” and “Integrating Compression and Execution
in Column-
Oriented Database Systems” [21, 3]
. Zukowskipresented an algorithm for compression optimization theusability of modern processor with less I/O bandwidth. Effectof run lengths on degree of compression and dictionaryencoding proven to be best compression scheme for VP [3].
III.
 
P
ERFORMANCE
M
EASURING
F
ACTORS
 
Our contribution to existing approach is based on the majorfactors affecting the performance of HP and VP (a)DataDistribution (b)Cardinality (c)Number of columns(d)Compression Technique and (e) Query nature.
 A.
 
 Data Characteristics
 
The search time, and performance of two relational tablesvaries with number of attributes, data type of each attributealong with the compression ratio, column cardinality andselectivity.
 B.
 
Compression Techniques
Dictionary based coding
The repeated occurrences are replaced by a codeword thatpoints to the index of the dictionary that contains the pattern.Both code words and uncompressed instructions are part of compressed program. Performance penalty occurs for (a)Dictionary cache line is bigger than processors L1 data cache(b) Index size is larger than value and (c) Un-encoded columnsize is smaller than the size of the encoded column plus thesize of the dictionary [3].
Delta coding
 The data is stored, as the difference between successivesamples (or characters). The first value in the delta encodedfile is the same as the first value in the original data. All thefollowing values in the encoded file are equal to the difference(delta) between the corresponding value in the input file, andthe previous value in the input file. For uniform values in thedatabase, delta encoding for data compression is beneficial.Delta coding may be performed on both column level andtuple level. For unsorted sequence and size-of(encoded) islarger than size-of(un-encoded), delta encoding is lessbeneficial [3].
Run Length Encoding (RLE)
The
 
sequences of the same data values within a file is replacedby a count number and a single value. RLE compressionworks best for sorted sequence, long runs. RLE is morebeneficial for VP [3].
C.
 
Query Parameters and Table Generation
To study the effect of queries with table characteristics,queries were tested with varying number of predicates andselectivity factor. Factors affecting the execution plan and costare (a)Schema definition (b) Selectivity factor (c) Number of columns referenced (d) Number of predicates. The executiontime of a query change with column characteristics and I/Obandwidth. For each characteristic of column, the querygenerator randomly selects the columns used to produce a set
of “equivalent” queries with the cost analysis [12].
Performance measure with compression is implemented by:
 
Generation of uncompressed HP version of eachtable with primary key on left most column.
 
Sorted on columns frequently used in query.
 
Replica is generated on VP.
IV.
 
I
MPLEMENTATION
D
ETAIL
 
To study the effect of VP and HP, the experiments are doneagainst TPC-H standard Star-Schema on MonetDB.We mainly concentrated on the fact table i.e. Sales, containsapproximately 10L records. We focused on five columns forselectivity i.e. prod_id, cust_id, time_id, channel_id, promo_idwith selectivity varying from 0.1 to 50%.SELECT p.product_name,ch.channel_class,c.cust_city, t.calendar_quarter_desc,SUM(s.amount_sold) sales_amountFROM sales s, times t, customers c, channels ch,products p, promotions prWHERE s.time_id = t.time_idAND s.prod_id=p.prod_idAND s.cust_id = c.cust_idAND s.channel_id = ch.channel_id
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 10, October 201199http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
AND s.promo_id=pr.promo_idAND c.cust_state_province = 'CA'AND ch.channel_desc in ('Internet','Catalog')AND t.calendar_quarter_desc IN ('1999-Q1','1999-Q2')GROUP BY ch.channel_class,p.product_namec.cust_city, t.calendar_quarter_desc;Table 1: Generalized Star-Schema Query
 A.
 
 Read-Optimized Blocks (Pages)
The HP and VP, dense pack the table on the blocks to achieveless I/O bandwidth. With varying page size HP keeps tuplestogether, while the VP stores each column in a different file.The different entries on the page are not aligned to byte orword boundaries in order to achieve better compression. Eachpage begins with the page header, contains number of entrieson the page, followed by data and compression dictionary.The size of the compression dictionary is stored at the veryend of the page, with the dictionary growing backwards fromthe end of the page towards the front. For the HP, thedictionaries for the dictionary-compressed columns are storedsequentially at the end of the page.
 B.
 
Query Engine, Scanners and I/O
The query scanner scans the files differently for HP and VP.Materialization of results are done after reading the data andapplying predicates to it, with minimum passes in HP thanVP, which requires reading multiple files for each columnreferenced by query. Predicates are applied on a per-columnbasis, columns are processed by order of their selectivity, mostselective (with the fewest qualifying tuples) to least selective(the most qualifying tuples). Placing the most selectivepredicate first allows the scanner to read more of the currentfile before having to switch to another file, since the outputbuffer fills up more slowly.
C.
 
 Experimental Setup
All results were run on a machine running RHEL 5 on a 2.4GHz Intel processor and 1GB of RAM. HP and VP areaffected by the amount of I/O and processing bandwidthavailable in the system; for each combination of outputselectivity and number of columns accessed.
Effect of selectivity
Selecting fewer tuples with very selective filter and index hasno effect on I/O performance, system time remains the same.The HP remains the same, since it has to examine each tuplein the relation to evaluate the predicate. For the VP evaluatingthe predicate requires more time. With decrease in selectivityVP and HP performance ratio is less. However as selectivityincreases towards 100%, each column scan contribute in CPUcost. The VP is faster than HP when more columns arereturned with the selectivity factor from 0.1% to 25%. Furtherwith same configuration compressed HP will speed up by 4 inVP (Figure 1).
Predicate Selectivity(%)No Of RowsHP(timein sec)VP(timein sec)Prod_id Compressed(50)1000000
3 14
Cust_id 25 1000000
45 10
Time_id 10 10,00000
40 20
Promo_id1 1000000
35 20
Channel_id0.1 1000000
30 30
Figure 1: Time measurement for HP and VP with varying selectivity andCompression
Effect of compression
For skew data distribution and large cardinality in HP, run-length and dictionary compression techniques are morebeneficial. The size of VP tuple is approximately same as sizeof HP tuple. HP compression is a critical component indetermining its performance relative to that of the VP.Compression is more beneficial for columns having highcardinality. For compression, some VP proponents haveargued that, since VP compress better than HP, storing thedata with multiple projections and sort orders are feasible andcan provide even better speedups [18].
Effect of Joins
We examined join operations for query presented in table 1,with varying predicates over HP and VP, to analyze theinteraction of resultant tuple with join (e.g. more instructioncache misses due to switching between scanning andreconstructing tuples and performing the join).Compression improves the performance by decreasing I/Obandwidth and increasing scan time, as the columns selectionratio grows. Unlike compression, cost of join operation hasincreased with increased list of selected columns. The HPoutperforms the VP as number of accessed columns is more.The join component of the time is always roughly equivalentbetween the HP and VP (Figure 2). Thus, the paradigm withthe smaller scan time will also have the smaller join time, and
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 10, October 2011100http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->