You are on page 1of 12

Journal of Discrete Algorithms 32 (2015) 7586

Contents lists available at ScienceDirect

Journal of Discrete Algorithms


www.elsevier.com/locate/jda

An ecient Variable-to-Fixed length encoding using


multiplexed parse trees
Satoshi Yoshida a,b, , Takuya Kida a
a
b

Graduate School of Information Science and Technology, Hokkaido University, Japan


Knowledge Discovery Research Laboratories, NEC Corporation, Japan

a r t i c l e

i n f o

Article history:
Available online 13 November 2014
Keywords:
Lossless data compression
VF code
Compressed pattern matching

a b s t r a c t
We discuss an improved method of Variable-to-Fixed length code (VF code) encoding. A VF
code is a coding scheme that splits an input text into a consecutive sequence of substrings,
and then assigns a xed length codeword to each substring. Since all the codewords have
the same length, they are easily extracted, and thus, VF code is suitable for processing
compressed texts directly. Almost Instantaneous VF code (AIVF code), which was proposed
by Yamamoto and Yokoo in 2001, achieves a good compression ratio by using a set of parse
trees. However, it requires more time and space for both encoding and decoding than does
a classical VF code. In this paper, we prove that the set of parse trees of AIVF code can be
multiplexed into a compact single tree and the original encoding and decoding procedures
can be simulated using the compacted tree. We also give the upper and lower bounds of
the number of nodes in the multiplexed parse tree, in addition to those of the number of
nodes reduced from the original trees. The experimental results showed that the proposed
method can encode natural language texts much faster than AIVF coding.
2014 Elsevier B.V. All rights reserved.

1. Introduction
1.1. Background
From the viewpoint of speeding up compressed pattern matching, variable-to-xed length codes (VF codes for short) were
reevaluated [8,6,15,9,18]. A VF code is a coding scheme that parses an input text into a consecutive sequence of substrings
with a dictionary tree, called a parse tree, and then assigns a xed length codeword to each substring. Tunstall code [12],
which is a classical and typical VF code, has been proved to be an entropy code for a memoryless information source (see
also [11]); its average code length per symbol comes asymptotically close to the entropy of the source when the code length
goes to innity. However, its actual compression ratio is not as good as that of other well-known codes, such as Huffman
code.
Several improvements on VF codes have been proposed thus far. Almost Instantaneous VF code (AIVF code for short1 ),
which was proposed by Yamamoto and Yokoo [15], is one of the most attractive codes from a practical viewpoint. Its

This is an extended version of papers published in DCC 2010 [16] and in ESKM 2012 [17].
Corresponding author.
E-mail addresses: syoshid@ist.hokudai.ac.jp (S. Yoshida), kida@ist.hokudai.ac.jp (T. Kida).
1
Almost instantaneous (VF) code originally meant that we need to read one symbol ahead for encoding. However, we use the word as the unique
name in this paper.

http://dx.doi.org/10.1016/j.jda.2014.10.005
1570-8667/ 2014 Elsevier B.V. All rights reserved.

76

S. Yoshida, T. Kida / Journal of Discrete Algorithms 32 (2015) 7586

Fig. 1. Multiplexing the parse trees of AIVF code into a single tree.

compression ratio is almost comparable to that of Huffman code.2 In fact, even if the codeword length is short, AIVF code
achieves a better compression ratio than Tunstall code does with considerably long codewords (see Section 5). However, its
encoding and decoding speeds are slower than those of Tunstall code. The reason is that AIVF code must construct k 1
parse trees, where k is the alphabet size of the text, which usually reaches 70140 for natural language texts. Thus, a large
amount of memory space is required to hold the trees. Therefore, a VF code with a more compact parse tree that can
achieve a good compression ratio as compared to AIVF code is desirable.
1.2. Our contribution
In this paper, we propose an ecient algorithm for encoding and decoding AIVF codes, which integrates the multiple
parse trees into a compact single tree and simulates the encoding and the decoding procedure of the original AIVF code.
Our idea originated in the observation that many nodes in the multiple parse trees of an AIVF code are common, and
thus, they can be multiplexed (see Fig. 1). We refer to the integrated parse tree as the Virtual Multiple AIVF parse tree (VMA
k
tree for short). We prove that the upper and lower bounds of the number of nodes in the VMA tree are Mk 12 k2 + k
1
and M ln(k + 1)

M
2

+ 1, respectively, where M denotes the number of codewords. We also prove that the upper and

lower bounds of the number of nodes reduced from the original multiple parse trees are
1 2
k
2

kM (k 12 )+ 12 (k6)(k+1)kM ln(k+1)
k1

and

+ 52 k 7, respectively. We show that in fact our technique allows much faster encoding than does the original AIVF

coding for natural language texts.


This paper is an extended version of [16] and [17]. We have revised more precisely the estimation of the theoretical
bounds presented in the previous versions and newly added two bounds. Moreover, we conducted additional experiments.
1.3. Related works
The family of dense compressors [2,1,4,3,7] is a class of word-based compression methods, which assigns a codeword
to a word or consecutive words. A dense compressor sorts all the words that occur in the input text by their occurrence
frequencies in descending order, and then, encodes the input text by encoding the rank of each word. Especially, v2vdc [3] is
a dense compressor that regards frequent consecutive words as a word and then encodes each word by End-Tagged Dense
Code [2]. Dense compressors achieve a high compression ratio on natural language texts and also enable us to conduct
pattern matching at high speed. However, they are not effective for texts that cannot be separated into words, such as DNA
data.
Kida [6] and Klein and Shapira [8] independently developed a sux tree-based VF code (STVF and DynC, respectively),
which uses a pruned sux tree as the parse tree. Uemura, Yoshida, and Kida [13] devised an improved STVF code utilizing
the instantaneous encoding technique. These methods achieved compression ratios that are good as compared to those of
Tunstall code, but not as good as those of gzip.
Recently, Yoshida et al. [18] presented a method that improves the compression ratios of VF codes by training the parse
tree, and showed that the compression ratios of VF codes can approach those of gzip. However, their algorithm is time
consuming for encoding, because it scans the whole text many times to train the parse tree.
1.4. Organization of this paper
The rest of this paper is organized as follows. In Section 2, we discuss Tunstall and AIVF code. In Section 3, we introduce
our proposed method. We present a theoretical analysis to determine the number of nodes in the integrated parse tree
in Section 4. We show some experimental results that show the performance of compression and decompression of the
proposed method in Section 5. Finally, we present our conclusions in Section 6.
2. Preliminaries
In this section, we introduce basic terms and notations. We also present a brief sketch of VF code.

It should be noted that AIVF code supposes a memoryless information source for its input as Tunstall and Huffman codes do.

S. Yoshida, T. Kida / Journal of Discrete Algorithms 32 (2015) 7586

77

2.1. Terminology and notation


Let be a nite alphabet and be the set of all strings over . The length of a string x is denoted by |x|. The
string whose length is 0 is called the empty string and is denoted by . Therefore, we have | | = 0. The concatenation of two
strings, x1 and x2 , is denoted by x1 x2 , and is also written simply as x1 x2 , if no confusion occurs.
The occurrence probability of string x in a text S is denoted by Pr S (x). We dene Pr S (a) as (the number of times
that symbol a occurs in text S)/(the length of text S) for a . We also dene Pr S ( ) = 1 for convenience. Although Pr S (x)
depends on S, we write it simply as Pr(x) when the target text is obvious from the context or when we treat it as the
statistical feature of a given information source.
A tree in which each node has at most k children is called a k-ary tree. A node that has some children is called an internal
node (or an inner node), and a node that has no children is called a leaf node (or a leaf). The node that has no parent, i.e.,
the top of a tree, is called the root node (or the root). Furthermore, in a k-ary tree, a node that has exactly k children is
called a complete internal node, and also an internal node that is not complete is called an incomplete internal node. A tree in
which all internal nodes are complete is called a complete k-ary tree.
For a tree (or a forest) T , the set of all leaves, the set of all incomplete internal nodes, and the set of all complete nodes
in T are denoted by L( T ), I ( T ), and C ( T ), respectively. The union of L( T ) and I ( T ) is denoted by N ( T ), and the size of
set S by #S. Then, for example, the number of leaves can be denoted by #L( T ). For a node n, the number of children of n
is called the degree of n, which is denoted by d(n).
2.2. Variable-to-xed length codes (VF codes)
In this section, we discuss Tunstall code and AIVF code. A VF code uses a parse tree when it encodes and decodes the
input text. A VF code parses the input text into variable length strings, called blocks, using the parse tree, and assigns a
xed length codeword to each of them. Hereafter, a node in the parse tree is identied by its corresponding block.
2.2.1. Tunstall code
For a given text whose alphabet size is k, Tunstall code uses a complete k-ary tree as a parse tree T , called a Tunstall
tree. Each edge in the tree is labeled with a symbol in . Each node corresponds to a string over , which is spelled out
from the root to the node. Each codeword, which is a binary string of length  = lg #L( T ), is assigned to each leaf in the
Tunstall tree, namely, there is a one-to-one correspondence between a codeword and a leaf. A given text is parsed into a
consecutive sequence of blocks by the tree, and each block is encoded with the corresponding binary codeword. Decoding
of Tunstall code is performed as follows: (i) read codewords one by one; (ii) nd the node corresponding to the codeword;
and (iii) output the block corresponding to the node.
Next, we consider the algorithm that constructs the parse tree that maximizes the average block length. It is assumed
that the information source is memoryless. Then, we have Pr(xa) = Pr(x) Pr(a) for x + , a . The algorithm that constructs the parse tree with at most M codewords is: (i) create the root node; and (ii) repeat to select a leaf that has the
highest probability and make it complete by adding k children to it while the number of leaves does not exceed M. The
optimality of the parse tree created here is proved in [14].
2.2.2. Almost instantaneous VF code
We now give a brief survey of AIVF code [15], which is based on Tunstall code. In order to improve the compression ratio
of AIVF, Yamamoto and Yokoo employed two techniques, one of which is to assign codewords to the incomplete internal
nodes in the parse tree, and the other is to use multiple parse trees. Let = {a1 , . . . , ak } and assume that all symbols
in are sorted in descending order of their occurrence probabilities, for convenience of discussion. That is, i < j implies
Pr(ai ) Pr(a j ). It is also assumed that all the codes discussed below are binary codes.
Improvement by assigning codewords to incomplete internal nodes In Tunstall tree, unused codewords of length  exist if
#L( T ) = 2 . This suggests that the average block length can be increased by assigning these unused codewords to some
strings. If a leaf is added to a complete k-ary tree, an incomplete internal node is formed. That is, this also suggests that the
average block length can be increased further if low-frequency leaves can be removed and the useful edges can be extended.
Fig. 2 is an example of the parse tree for an AIVF code, where = {a, b, c } (i.e., k = 3), the occurrence probabilities are
0.6, 0.3, 0.1, respectively, and the codeword length  is equal to 3. All leaves and incomplete internal nodes have their own
codewords, that is, for all x N ( T ), x is assigned a codeword.
Improvement by using multiple parse trees In AIVF code, blocks are not statistically independent, even if the information
source is memoryless. Parsing with a parse tree in which incomplete internal nodes have codewords causes contexts between blocks. Yamamoto and Yokoo also showed that the average block length can be increased by using a set of parse trees
in order to catch the contexts. For example, assume that block aaa is currently parsed and 001 is output, while encoding
is performed using the parse tree shown in Fig. 2. In this case, the next symbol is b or c, because the traverse had to be
continued if the next symbol was a, and thus, block aaaa should be parsed and 000 should be output. Therefore, when
001 is output, the nodes corresponding to the codes 000, 001, 010, 011, and 100 are unreachable in the next traverse. This

78

S. Yoshida, T. Kida / Journal of Discrete Algorithms 32 (2015) 7586

Fig. 2. Parse tree T 0 for AIVF code. A circle indicates an internal node and a square indicates a leaf.

Fig. 3. Another parse tree T 1 for AIVF code.

suggests that the average block length can be increased by assigning these unreachable codewords to other strings. For the
running example, instead of using the tree in Fig. 2, in this case, the length of the next block can be increased by using the
tree in Fig. 3.
In the method proposed in [15], k 1 parse trees T i (i = 0, . . . , k 2) are utilized. For each i, the ith parse tree T i has
a root and its children are labeled by ai +1 , . . . , ak (i = 0, . . . , k 2) (recall that Pr(ai ) Pr(a j ) for i < j). A given text is
encoded and decoded by switching the parse trees according to the context.
We now discuss how to construct the optimal parse tree for each T i . Let M
be the number of codewords. Then, our aim is
to construct the optimal parse tree that maximizes the average block length
xN ( T ) |x| Pr(x) for a memoryless information
source. The algorithm is shown in Algorithm 1. It should be noted that, when the k 1th child of an incomplete node is
added, the average block length can be increased by adding the kth child without increasing the number of codewords,
because a codeword is not assigned to a complete internal node. Recall that it is assumed that Pr(ai ) Pr(a j ) for i < j. In
this case, a greedy algorithm yields the optimal solution. The algorithm constructs T i as follows. First, it creates the root
node and its k i children as the initial tree. It should be noted that we treat the root as a complete internal node. Next,
repeat the following while the number of children of n which are not created is the number of unused codewords or more,
where n is the incomplete internal node or leaf with maximum probability: (i) Let S 1 be the average block length if we
create all the children of n and S 2 be the one if we create k d(n) 1 leaves with maximum probabilities sequentially. (ii)
otherwise we create k d(n) 1 leaves with maximum probabilities sequentially.
If S 1 S 2 , we create all the children of n,
Finally, we create M m leaves with maximum probabilities sequentially where m is the number of codewords used.
The encoding algorithm with multiple parse trees is Algorithm 2. First, T 0 is selected as the current parse tree. Then, the
symbols are read one by one, and the parse tree is traversed by the symbol. If the child labeled by the symbol does not exist,
the codeword assigned to the node is output. Next, it must be determined which tree is used for the next traverse. Let n be
the node that the preceding traverse nally reached, and let n have d(n) children labeled by a1 , a2 , . . . , ad(n) . Then, the next
symbol is larger than ad(n) . Thus, the root of the next tree should have k d(n) children labeled by ad(n)+1 , ad(n)+2 , . . . , ak .
The root of the tree T i of the multiple parse trees has k i children labeled by ai +1 , ai +2 , . . . , ak . Hence, T d(n) is selected
for the next parse tree and we jump to the root of T d(n) . For example, let an input text be S = aabaabaccab, and consider
encoding S with the parsing trees in Figs. 2 and 3. Then, S is split into blocks, aa baa bac ca b. Therefore, the output
binary sequence of length 15 bits is obtained as 001 001 011 111 101. For the decoding, the same tree must be used as
for the encoding.
3. Virtual multiple AIVF tree
In this section, we explain our idea and present an ecient algorithm for AIVF code. In the parse trees of AIVF code,
it can be observed that many nodes in T i are identical to those in T i +1 . More precisely, T i +1 completely covers the nodes

S. Yoshida, T. Kida / Journal of Discrete Algorithms 32 (2015) 7586

79

Algorithm 1 Constructing multiple parse trees.


1: for i = 0, 1, . . . , k 2 do
2:
Create initial tree T i which consists of the root node and its k i children.
3:
m k i. {m is the number of nodes that have codewords.}
4:
n argmaxnN ( T i ) Pr(n).
5:
while k d(n) 1 M m do
6:
S 1 The average block length assuming that we call Complete().
7:
S 2 The average block length assuming that we call FindOptPos() k d(n) 1 times.
8:
if S 1 S 2 then
9:
Call Complete().
10:
else
11:
Call FindOptPos() k d(n) 1 times.
12:
end if
13:
m m + k d(n) 1.
14:
n argmaxnN ( T i ) Pr(n).
15:
end while
16:
Call FindOptPos() M m times.
17: end for
Procedure Complete()
18:
n argmaxnN ( T i ) Pr(n).

19:
Add k d(n) children n j , j = d(n) + 1, . . . , k to n.
Procedure FindOptPos()
20:
n argmaxnN ( T i ) Pr(n) Pr(ad(n)+1 ).

21:
Add the d(n) + 1-th child n of n.

Algorithm 2 Encoding with multiple parse trees.


1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:

Construct the multiple parse trees T 0 , T 1 , . . . , T k2 .


T T0.
n the root of T 0 .
while not the end of the input text do
c the next symbol of the input text.
if there is no child of n in which we can traverse by symbol c then
Output the codeword of n.
T T d(n) .
n the root of T .
else
n the child of n labeled by c.
end if
end while

(i )

in T i , except for the leftmost3 subtree under the node corresponding to ai +1 . First, we explain this relationship. Let S j be
the subtree of T i that consists of all the nodes under the direct child of the root corresponding to a j . Then, we have the
following theorem.
(i +1)

Theorem 1. Subtree S i + j

(i )

completely covers S i + j for any integers i (0 i k 3) and j (2 j k i ).

Proof. We prove the theorem by contradiction. For an integer i (0 i k 2), let T i be the tree of innite depth in which
the root has children according to ai +1 , . . . , ak , and all the other internal nodes have just k children. It should be noted that
(i )

the root of the multiple parse tree T i has children according to a j ( j = i + 1, . . . , k). That is, T i includes S j ( j = i + 1, . . . , k).
It should be also noted that each T i is optimal in the sense that it maximizes the average block length. That is, for any i
(i )
(0 i k 3) and j (2 j k i ), the average block length cannot be further increased by exchanging any node in S i + j
(i +1)

for a node included in T i but not in T i . Here, we assume that S i + j does not cover S i + j completely. Then, there exists a
(i )
(i +1)
node that is in S i + j but not in S i + j . Let n be such a node. Since T i is a parse tree that maximizes the average block length,
(i +1)

it can be increased by exchanging a node in S i + j

(i )

(i )

but not in S i + j for n. However, this contradicts that T i +1 maximizes the

(i +1)
(i )
average block length. Therefore, S i + j completely covers S i + j .

The set of trees can be multiplexed and simply integrated into a single tree according to the above theorem. An integrated tree is called a Virtual Multiple AIVF tree (VMA tree for short). To simulate the encoding and the decoding of the
original AIVF codes by using the VMA tree, it is necessary to indicate which parse tree is currently being traversed during
processing. Thus, each node in the VMA tree is marked to indicate to which trees the node belongs. Since a node can belong
to several parse trees, the least i such that n belongs to T i is saved for each node. Denoting this mark by Tn(n), we have

Each edge is arranged in descending order of the occurrence probability of its label in left to right manner.

80

S. Yoshida, T. Kida / Journal of Discrete Algorithms 32 (2015) 7586

Fig. 4. An example of a VMA tree. This is an integrated tree obtained by multiplexing the trees shown in Fig. 2 and in Fig. 3. The number written in each
node indicates the value of Tn(n).

Algorithm 3 Encoding algorithm with a VMA tree T .


1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:

n the root of T .
i 0.
while not end of the input text do
c the next symbol of the input text.
if there exists a child of n labeled by c then
n the child of n labeled by c.
if Tn(n ) i then
n n .
else
Goto line 13.
end if
else
Output w i (n).
i d(n).
n child of the root labeled by c.
end if
end while

that Tn(n) = mini {i | 0 i k 2, n belongs to T i .}. For example, by integrating the two parse trees shown in Fig. 2 and
Fig. 3, the parse tree shown in Fig. 4 is obtained.
To encode with a VMA tree, the previous encoding algorithm must be modied, because even if there is a child in the
VMA tree for the next traverse, the traverse fails when there is no child in T i . Therefore, Tn(n) and the number i of the
currently traversing tree T i are compared. If i is less than Tn(n), there is not a proper node in T i , and we return to the root.
The encoding algorithm is shown in Algorithm 3, where the codeword assigned to n in T i is denoted by w i (n).
The algorithm for constructing a VMA tree is shown in Algorithm 4. Let S i denote the subtree that consists of all nodes
under the root node corresponding to ai . We dene #N ( S 0 ) = M k for convenience. We dene the function cod(a j ) = j,
and denote by rst(n) the rst symbol of the label sequence on the path from the root to n. That is, if rst (n) = a j ,
(i )

cod(rst(n)) = j. The number of nodes having the codeword is denoted by m in Algorithm 4. Subtrees S i +1 for i = 0, . . . ,
(k2)
k 2 and S k
are in the VMA tree from Theorem 1. Therefore, this algorithm repeats constructing a parse tree T in the
manner of Algorithm 1 and removing the leftmost subtree from T to T V . It should be noted that the number of nodes with
codewords to be created is the same as the number of nodes that have codewords in the subtree moved to T V for next
iteration.
4. Bound analysis of VMA tree
In this section, we discuss the upper and lower bounds of the number of nodes in a VMA tree. Let M be the number
of codewords, i.e., M = 2 for the codeword of length . We prove that the upper and lower bounds of the total number
k
of nodes in a VMA tree are respectively Mk 12 k2 + k
in Theorem 2 and M ln(k + 1) M
+ 1 in Theorem 3 and that
2
1
the lower and upper bounds of the number of nodes reduced by integrating multiple parse trees into one are respectively
1 2
k
2

+ 52 k 7 in Theorem 4 and

kM (k 12 )+ 12 (k6)(k+1)kM ln(k+1)
k1

in Theorem 5.

k
Theorem 2. An upper bound of the total number of nodes in the integrated parse tree of an AIVF code is Mk 12 k2 + k
, where M
1
and k are the number of codewords and the alphabet size, respectively.

Proof. The VMA tree can be viewed as a joint tree that consists of several subtrees of the AIVF tree: the leftmost subtrees
(i )
(k2)
(k2)
(i )
S i +1 for i = 0, . . . , k 3 and subtrees S k1 and S k
of tree T k2 . Subtrees S i +1 have M (k i 1) codewords because

1
each subtree has at least one codeword. A k-ary tree with L leaves has at most kL
complete internal nodes. Therefore, the
1
upper bound of the number of nodes in the VMA tree is

S. Yoshida, T. Kida / Journal of Discrete Algorithms 32 (2015) 7586

81

Algorithm 4 Constructing a VMA tree T V .


1: T The tree with root and k children of it; T V A tree consists of the root.
2: Tn() 0 for all nodes in T .
3: for l = 0 to k 2 do
4:
m #N ( S l ); n argmaxnN ( T ) Pr(n).
5:
while k d(n) 1 m do
6:
S 1 The average block length assuming that we call Complete().
7:
S 2 The average block length assuming that we call FindOptPos() k d(n) 1 times.
8:
if S 1 S 2 then
9:
Call Complete().
10:
else
11:
Call FindOptPos() k d(n) 1 times.
12:
end if
13:
n argmaxnN ( T ) Pr(n).
14:
m m k + d(n) + 1.
15:
end while
16:
Call FindOptPos() m times.
17:
S The subtree corresponding the node traversed from the root by al+1 .
18:
R T except for S; Add S to T V ; T R.
19: end for
20: Add T s subtree to T V .
Procedure Complete()
21:
n argmaxnN ( T ) Pr(n).

22:
Add k d(n) children n j , j = d(n) + 1, . . . , k to n.
23:
Tn(n j ) cod(rst(n j )) 1.
Procedure FindOptPos()
24:
n argmaxnN ( T ) Pr(n) Pr(ad(n)+1 ).

25:
Add the d(n) + 1-th child n of n.
26:
Tn(n) cod(rst(n)) 1.

k 3


i =0

(i )

(i )

# C S i +1 + # N S i +1







+ #C S k(k12) + #C S k(k2)





+ #N S k(k12) + #N S k(k2) + 1

k 3 

M (k i 1) 1
M 1

M (k i 1) +
+1+M
+
k1
k1
i =0

k1

= Mk k2 +

Theorem 3. A lower bound of the total number of nodes in the integrated parse tree of an AIVF code is M ln(k + 1)
and k are the number of codewords and the alphabet size, respectively.

M
2

+ 1, where M

(i )

Proof. Since the symbols a1 , . . . , ak are sorted in descending order of their occurrence probabilities, we have #N ( S j )
(i )

(i )

#N ( S j +1 ). Therefore, we have #N ( S i +1 ) kM
i . Hence the lower bound of the number of nodes in the VMA tree is
k 3

i =0

(i )

(k2) 

# N S i +1 + # N S k 1

 (k2) 
+ #N S k
+1


k 3 

M
+M +1
ki
i =0

k+1
M

1
i

di

M
2

+1

M ln(k + 1)

M
2

+ 1.

Theorem 4. An upper bound of the number of nodes reduced by the integration is


the number of codewords and the alphabet size, respectively.

kM (k 12 )+ 12 (k6)(k+1)kM ln(k+1)
,
k1

where M and k are

82

S. Yoshida, T. Kida / Journal of Discrete Algorithms 32 (2015) 7586

Table 1
Experimental text les.
Texts

Size (byte)

||

0-th order entropy

Content

english.300MB
dna

300 000 000


403 927 746

225
16

0.5655
0.2478

English document
DNA sequence

(0)

(0)

(1)

(1)

(k3)

(k3)

Proof. Subtrees S 2 , . . . , S k , S 3 , . . . , S k , . . . , S k1 , S k
to the proof of Theorem 3, the upper bound is

k
k 3


i =0

#N S j

(i ) 

+ #C S j

+k2

j = i +2

k 3 


(k 1)

i =0

(i ) 

M
ki

M
(k
k i

i 1) 1
k1


+k2

kM (k 12 ) + 12 (k 6)(k + 1) kM ln(k + 1)
k1

and k 2 root nodes are reduced by the integration. Analogous

Theorem 5. A lower bound of the number of nodes reduced by the integration is 12 k2 + 52 k 7, where k is the alphabet size.
Proof. Analogous to the proofs of Theorem 2 and Theorem 4, the lower bound is

k
k 3


i =0

(i ) 

#N S j

+1

j = i +2
k 3

(k i + 2)
i =0

= k 2 + k 7.

5. Experiments
We implemented Tunstall code, AIVF code, and our proposed method.4 We abbreviate these programs as Tunstall, AIVF,
and VMA, respectively. All the programs we used are written in C++ and compiled by g++ of GNU, version 4.6. We embedded the information of tree structures by balanced parenthesis [10] into compressed texts. We ran our experiments on a
workstation equipped with an Intel Xeon(R) 3.00 GHz CPU with 12 GB RAM, which operated Ubuntu 12.04. We used dna
and english.300MB, which is the rst 300 MB of english from Pizza&Chili Corpus.5 For details, please refer to Table 1.
First, we measured the compression ratios, compression times, and decompression times of Tunstall, AIVF, and VMA. We
measured 100 (compressed le size)/(original le size) as the compression ratio. Table 2 shows the compression ratios.
The compression ratios of AIVF and VMA are better than that of Tunstall on english.300MB and dna. Since AIVF and VMA
essentially perform the same compression, their compression ratios are almost the same. However, slight differences are
caused by the differences in the size of parse trees. The compression times are shown in Table 3. The compression of VMA
and AIVF is slower than that of Tunstall, because VMA and AIVF create more nodes than does Tunstall. In addition, the
compression of VMA is faster than that of AIVF where codeword length is long, because VMA creates fewer nodes than
does AIVF. The improvement in compression time is larger on english.300MB than on dna. AIVF constructs k 1 parse
trees where k is the alphabet size. Therefore, VMA reduces many nodes when a corpus whose alphabet size is large is
compressed. Although the number of nodes in the parse tree exponentially grows as the codeword length increases, the
compression time of Tunstall does not seem to slow down. The reason is considered to be that the total amount of I/O time
is reduced because the total amount of data is decreased as the codeword length increases. Next, the decompression times
are shown in Table 4. The decompression of VMA and AIVF is slower than that of Tunstall. Since the number of nodes in
the parse trees of VMA and AIVF is larger than that of Tunstall, the reconstruction of the parse trees of VMA and AIVF is
slower than that of Tunstall. Moreover, the decompression of VMA is slightly slower than that of AIVF, because VMA needs
extra operations to compute the degree of a node by comparing the minimum tree number of its children and the current
tree number while decoding. The numbers of nodes of AIVF and VMA and its upper and lower bounds in the VMA tree are

4
5

Source codes are available from http://www-ikn.ist.hokudai.ac.jp/~kida/download/VMA.tar.gz.


http://pizzachili.dcc.uchile.cl/index.html.

S. Yoshida, T. Kida / Journal of Discrete Algorithms 32 (2015) 7586

83

Table 2
Compression ratios of Tunstall, AIVF, and VMA in percentage.
Codeword length

8
9
10
11
12
13
14
15
16

english.300MB

dna

Tunstall

AIVF

VMA

Tunstall

AIVF

VMA

100.00
94.19
93.18
83.81
81.24
78.55
76.76
74.37
73.82

79.52
63.09
59.98
59.06
58.99
58.64
58.56
58.72
59.20

79.52
63.08
59.96
59.03
58.95
58.55
58.38
58.36
58.49

35.22
34.23
32.74
32.10
31.09
30.54
30.09
29.65
29.26

26.29
26.11
25.91
25.75
25.70
25.64
25.57
25.48
25.52

26.29
26.11
25.91
25.75
25.71
25.65
25.59
25.53
25.61

Table 3
Compression times of Tunstall, AIVF, and VMA in seconds.
Codeword length

8
9
10
11
12
13
14
15
16

english.300MB

dna

Tunstall

AIVF

VMA

Tunstall

AIVF

VMA

19.54
20.49
19.65
19.63
17.29
18.50
19.03
18.19
17.69

20.26
20.23
22.35
26.31
34.44
63.16
160.71
642.42
5241.48

28.90
28.26
29.34
32.33
33.17
39.61
60.06
143.83
838.33

18.54
17.21
16.62
17.19
17.34
17.21
16.83
16.52
16.05

18.45
18.36
18.84
19.22
20.06
22.31
32.48
96.45
774.81

20.31
20.42
21.54
21.01
21.37
23.27
30.41
81.04
483.77

Table 4
Decompression times of Tunstall, AIVF, and VMA in seconds.
Codeword length

8
9
10
11
12
13
14
15
16

english.300MB

dna

Tunstall

AIVF

VMA

Tunstall

AIVF

VMA

17.15
15.40
14.36
12.71
11.37
10.90
9.70
9.14
11.78

15.61
14.54
14.65
15.14
15.82
18.52
19.53
23.40
27.49

18.42
17.29
17.41
17.62
17.94
19.08
20.72
27.77
35.15

8.44
7.77
6.90
6.89
6.20
6.04
5.51
5.57
5.86

14.40
14.35
14.14
14.59
14.30
14.37
15.16
16.89
20.21

15.15
15.11
14.88
15.05
14.93
15.72
15.96
20.36
25.11

Table 5
The number of nodes of AIVF and VMA and its upper and lower bounds in the VMA tree for english.300MB.
Codeword length
8
9
10
11
12
13
14
15
16

AIVF
57 568
114 912
229 600
458 976
917 728
1 835 232
3 670 240
7 340 256
14 680 288

VMA
3849
8210
18 239
38 420
77 470
152 665
305 362
614 054
1 238 041

Upper bound from Theorem 2


32 289
89 889
205 089
435 489
896 289
1 817 889
3 661 089
7 347 489
14 720 289

Lower bound from Theorem 3


1261
2520
5040
10 078
20 156
40 310
80 619
161 237
322 473

shown in Tables 5 and 6. VMA considerably reduces the number of nodes especially on the English text. Finally, the number
of nodes reduces by the integration and its upper and lower bounds are shown in Tables 7 and 8.

84

S. Yoshida, T. Kida / Journal of Discrete Algorithms 32 (2015) 7586

Table 6
The number of nodes of AIVF and VMA and its upper and lower bounds in the VMA tree for dna.
Codeword length

AIVF

VMA

8
9
10
11
12
13
14
15
16

3855
7695
15 375
30 735
61 455
122 895
245 775
491 535
983 055

1913
3949
7662
15 851
30 605
62 510
123 947
251 769
490 842

Upper bound from Theorem 2


3969
8065
16 257
32 641
65 409
130 945
262 017
524 161
1 048 449

Lower bound from Theorem 3


598
1196
2390
4779
9558
19 115
38 228
76 456
152 911

Table 7
The number of nodes reduced by integration and its upper and lower bounds for english.300MB.
Codeword length
8
9
10
11
12
13
14
15
16

The number of nodes


reduced by the integration
53 719
106 702
211 361
420 556
840 258
1 682 567
3 364 878
6 726 202
13 442 247

Upper bound of from


Theorem 4
56 445
112 780
225 449
450 788
901 466
1 802 822
3 605 533
7 210 955
14 421 799

Lower bound of from


Theorem 5
25 868
25 868
25 868
25 868
25 868
25 868
25 868
25 868
25 868

Table 8
The number of nodes reduced by integration and its upper and lower bounds for dna.
Codeword length

The number of nodes


reduced by the integration

Upper bound of from


Theorem 4

Lower bound of from


Theorem 5

8
9
10
11
12
13
14
15
16

1942
4746
7713
14 884
30 850
60 385
121 828
239 766
492 213

3465
6923
13 841
27 677
55 348
110 690
221 374
442 742
885 478

161
161
161
161
161
161
161
161
161

We also compared eight compression algorithms: Tunstall, AIVF, VMA, STVF,6 v2vdc [3], gzip, bzip2, and LZMA. We used
the default options for gzip, bzip2, and LZMA. We selected 14 for the codeword length of VF codes. The results are shown
in Table 9. It should be noted that v2vdc is a word-based compression method and it therefore is not effective for DNA
data. We represent this by N/A in Table 9. The compression ratios of VMA, AIVF, and Tunstall are worse than those of
v2vdc, gzip, and bzip2 for the English text, because they do not assume a Markov source. On the other hand, VMA and AIVF
are rather good for the DNA text. V2vdc takes a long time to compress English text to construct sux arrays. Although the
compression and decompression speeds of VMA is slower than bzip2 for the English text, those of VMA for DNA text is
faster than those of bzip2. The reason is that our implementation for VMA, AIVF, and Tunstall takes O (||) time to traverse
the parse tree for a character.
Finally, we measured performance of compressed pattern matching algorithms. We compared VMA, AIVF, Tunstall, gzip,
and v2vdc. We used UNIX zgrep for pattern matching on compressed text with gzip. We chose patterns with lengths of 525
characters in the text. We measured the pattern matching times for 5 patterns of each length and calculated the average.
Tables 10 and 11 list the results for the matching times. We set 12 for the codeword lengths of VF codes. The leftmost
column indicates the pattern length. We indicate the results for dna of v2vdc as N/A because it is not applicable, that is,
effective for the data as mentioned above. The experimental results show that pattern matching on VF codes are superior
to that on gzip. Although v2vdc has the advantage for natural language texts like english.300MB, it is not directly applicable
to texts of unseparated words like a DNA sequence.

6
STVF does not work on several hundreds megabytes of text with our experimental environment due to the lack of memory. We therefore divided the
input text into a hundred megabyte texts and measured compressed le size and compression and decompression times as summation of compressed le
size and processing times of them, respectively.

S. Yoshida, T. Kida / Journal of Discrete Algorithms 32 (2015) 7586

85

Table 9
Experimental results compared with variable length encoding methods. The codeword length of VF codes is xed to 14.
Compression ratios (%)

VMA
AIVF
Tunstall
STVF
v2vdc
gzip
bzip2
LZMA

Compression times (sec)

Decompression times (sec)

english.300MB

dna

english.300MB

dna

english.300MB

dna

58.38
58.56
76.76
56.74
27.07
37.82
28.03
24.98

25.59
25.57
30.09
24.76
N/A
28.12
25.76
22.73

60.06
160.71
19.03
1175.61
6266.11
23.60
38.10
345.05

30.41
32.48
16.83
672.42
N/A
49.87
52.51
730.48

20.72
19.53
9.70
7.54
4.95
2.79
13.00
5.72

15.96
15.16
5.51
4.75
N/A
3.33
20.98
7.66

Table 10
Pattern matching times for english.300MB in seconds. The codeword length of VF codes is 12.
Pattern length

Tunstall

AIVF

VMA

gzip

v2vdc

5
10
15
20
25

1.91
1.97
2.00
2.03
2.03

2.07
2.06
2.07
2.08
2.09

2.11
2.14
2.14
2.15
2.15

2.78
2.73
2.69
2.66
2.65

0.51
0.51
0.51
0.51
0.51

Table 11
Pattern matching times for dna in seconds. The codeword length of VF codes is 12.
Pattern length

Tunstall

AIVF

VMA

gzip

v2vdc

5
10
15
20
25

1.03
1.04
1.04
1.05
1.07

0.98
0.98
0.99
0.99
0.99

1.01
1.01
1.01
1.02
1.02

3.53
4.98
4.09
4.31
5.01

N/A
N/A
N/A
N/A
N/A

6. Conclusion
In this paper, we presented an ecient algorithm for AIVF codes, which integrates the multiple parse trees of an AIVF
code into one, called a VMA tree, and simulates the encoding process and the decoding process on it. We also estimated
the number of nodes in the VMA tree. We conducted several experiments to evaluate the compression performance of our
method. Although the decompression speed by using a VMA tree is slower than that of the original AIVF encoding, the
compression speed is faster than that of AIVF when using long codewords.
This method only works for memoryless sources, such as DNA, and cannot be extended to natural texts where the
probability of each character depends on the context (i.e. they can be modeled well by higher order Markov models), and
that the reason the method cannot be extended is that for higher order Markov sources the number of possible trees would
be exponential in the alphabet size in the worst case.
Although the methods presented in [8,6] have better compression ratios than do AIVF codes, they need to construct sux
trees [5] in order to construct a parse tree, and thus, they take a long time to achieve this. Therefore, from the viewpoint
of speeding up compressed pattern matching, AIVF code with the VMA tree is a strong candidate as well, because usually a
parse tree has to be constructed in order to conduct compressed pattern matching on VF codes.
Acknowledgements
This work was supported by JSPS Research Fellowships for Young Scientists and JSPS KAKENHI Grant Numbers 24240021,
25240003, 24650062.
References
[1] N. Brisaboa, A. Faria, G. Navarro, M. Esteller, (S, C)-dense coding: an optimized compression code for natural language text databases, in: Proceedings
of the 10th International Symposium on String Processing and Information Retrieval (SPIRE 2003), in: Lecture Notes in Computer Science, vol. 2857,
Springer, 2003, pp. 122136.
[2] N. Brisaboa, E. Iglesias, G. Navarro, J. Param, An ecient compression code for text databases, in: Proceedings of the 25th European Conference on IR
Research (ECIR 2003), in: Lecture Notes in Computer Science, vol. 2633, Springer, 2003, pp. 468481.
[3] N. Brisaboa, A. Faria, J.-R. Lpez, G. Navarro, E. Lopez, A new searchable variable-to-variable compressor, in: Proceedings of the Data Compression
Conference 2010 (DCC 2010), IEEE, 2010, pp. 199208.
[4] N. Brisaboa, A. Faria, G. Navarro, J. Param, Dynamic lightweight text compression, ACM Trans. Inf. Syst. 28 (3) (2010) 10:110:32.
[5] M. Crochemore, W. Rytter, Jewels of Stringology, World Scientic, 2002.

86

S. Yoshida, T. Kida / Journal of Discrete Algorithms 32 (2015) 7586

[6] T. Kida, Sux tree based VF-coding for compressed pattern matching, in: Proceedings of the Data Compression Conference 2009 (DCC 2009), IEEE,
2009, p. 449.
[7] S. Klein, M. Ben-Nissan, Using Fibonacci compression codes as alternatives to dense codes, in: Proceedings of the Data Compression Conference 2008
(DCC 2008), IEEE, 2008, pp. 472481.
[8] S. Klein, D. Shapira, Improved variable-to-xed length codes, in: Proceedings of the 15th International Symposium on String Processing and Information
Retrieval (SPIRE 2008), in: Lecture Notes in Computer Science, vol. 5280, Springer, 2008, pp. 3950.
[9] S. Maruyama, Y. Tanaka, H. Sakamoto, M. Takeda, Context-sensitive grammar transform: compression and pattern matching, in: Proceedings of the
15th International Symposium on String Processing and Information Retrieval (SPIRE 2008), in: Lecture Notes in Computer Science, vol. 5280, Springer,
2008, pp. 2738.
[10] I. Munro, V. Raman, S. Rao, Space ecient sux trees, J. Algorithms 39 (2) (2001) 205222.
[11] S. Savari, Variable-to-xed length codes for predictable sources, in: Proceedings of the Data Compression Conference 1998 (DCC 98), IEEE, 1998,
pp. 481490.
[12] B. Tunstall, Synthesis of noiseless compression codes, Ph.D. thesis, Georgia Institute of Technology, Atlanta, GA, 1967.
[13] T. Uemura, S. Yoshida, T. Kida, An improvement of STVF code by almost instantaneous encoding, Tech. rep., TCS-TR-A-10-44, Hokkaido University, 2010.
[14] J. Verhoeff, A new data compression technique, Ann. Syst. Res. 6 (1978) 139148.
[15] H. Yamamoto, H. Yokoo, Average-sense optimality and competitive optimality for almost instantaneous VF codes, IEEE Trans. Inf. Theory 47 (6) (2001)
21742184.
[16] S. Yoshida, T. Kida, An ecient algorithm for almost instantaneous VF code using multiplexed parse tree, in: Proc. of the Data Compression Conference
2010 (DCC 2010), 2010, pp. 219228.
[17] S. Yoshida, T. Kida, Analysis of multiplexed parse trees for almost instantaneous VF codes, in: Proceedings of the 2012 IIAI International Conference on
Advanced Applied Informatics, IIAI-AAI 12, IEEE Computer Society, Washington, DC, USA, 2012, pp. 3641.
[18] S. Yoshida, T. Uemura, T. Kida, T. Asai, S. Okamoto, Improving parse trees for ecient variable-to-xed length codes, J. Inf. Process. 20 (1) (2012)
238249.

You might also like