Professional Documents
Culture Documents
sidhu@halcyon.usc.edu prasanna@ganges.usc.edu
Figure 1: Syntax tree for ((ajb))(cd). 2.4 NFA Construction for Given Regu-
lar Expression
Below we review the construction of an NFA which matches
and (4) (r1 ) matches the same strings as r1 . Further, is a the same strings as the given regular expression. NFAs for
regular expression that matches the empty string. As an ex- smaller regular expressions can be used to construct NFAs
ample, the regular expression ((ajb))(cd) matches all strings for bigger regular expressions by applying the following rules.
composed of any sequence of zero or more a and b charac- An NFA that matches a single character (c) is constructed as
ters that terminate with the characters cd. Figure 1 shows shown in Figure 2(a). NFAs that match the regular expres-
the syntax tree representation of the expression which is used sions r1 jr2 , r1 r2 , r1 can be constructed as shown in Figure
later. An additional metacharacter is used to denote con- 2 (b), (c) and (d) respectively. N1 and N2 are NFAs for the
catenation in the syntax tree representation. regular expressions r1 and r2 respectively. Each NFA has the
initial state q and only one nal state f . The NFA for (r1 )
is the same as that for r1 . By parsing a regular expression
2.2 NFA and DFA into its constituent subexpressions, and applying the above
rules recursively, an NFA can be constructed that matches
Informally, an NFA is a directed graph in which each node the same strings as the given regular expression. As an ex-
is a state and each edge is labeled with a single character or ample, Figure 3 shows the NFA constructed using above rules
. One state is designated the initial state and some states for ((ajb))(cd).
are accepting or nal states. A DFA is an NFA with no
edges labeled and no state has more than one outgoing
edge labeled with the same character. An FA (NFA or DFA) 3 NFA Construction using FP-
processes an input string and either accepts it or rejects it.
A string is accepted if the string characters match the labels GAs
on any path of the FA that leads from the initial state to an
accepting state. We begin by showing in Section 3.1 how an arbitrary NFA can
be implemented in logic. In Section 3.2 we describe simple
logic structures that implement NFAs (shown in Figure 2) for
2.3 Regular Expression Matching using a single character, r1 jr2 , r1 r2 and r1 . Section 3.3 presents an
algorithm that reads a regular expression and uses the above
FAs logic structures to implement an NFA that matches the same
Given a regular expression, it is possible to construct an NFA strings as the given regular expression. Implementation of
from it which accepts the same strings that are matched by the above algorithm on conventional FPGAs is discussed in
the regular expression. Such an NFA can therefore be used to Section 3.4.
process the input text and perform regular expression match-
ing. On a serial machine, an NFA can be constructed from 3.1 NFA Implementation in Logic
the given regular expression of length n in O(n) time and
the NFA takes O(n) memory. The NFA processes a single There are two standard techniques for implementing DFAs
text character in O(n) time. Thus the total time required to in logic [2]. One way is to use memory that stores the DFA
search through text of length m is O(mn) and the memory states in a (typically binary) encoded form. Since the mem-
required is O(n). An alternate approach for faster search- ory can look up only one next state every clock cycle, it would
ing is to construct a DFA from the regular expression which be diÆcult to extend the above technique to eÆciently im-
takes O(2n ) time and O(2n ) memory. The DFA can process plement an NFA. The other technique is to use the One-Hot
a text character in O(1) time. Thus the above approach re- Encoding (OHE) scheme. One
ip-
op is associated with
quires O(2n + m) time and O(2n ) memory. Text searching each state and at any time exactly one
ip-
op stores a 1,
programs such as grep employ the latter approach but use signifying the current (or active) state. Combinational logic
optimization techniques that in many cases signicantly re- associated with each
ip-
op ensures that the 1-bit from the
duces DFA construction time and memory. However, there active
ip-
op is transferred in the next clock cycle to only
q c f
q1 N1 f1 ε q2 N2 f2
(a)
(c)
ε
q1 N1 f1
ε ε
q f q ε q1 N1 f1 ε f
ε ε
q2 N2 f2
ε
(b) (d)
Figure 2: NFAs for (a) single character, (b) r1 jr2 , (c) r1 r2 , and (d) r1 .
a
ε ε
ε ε ε c ε d
q f
ε
b ε
one
ip-
op (the input to the DFA determines which one).
ip-
op does not have to be implemented and can be elimi-
Below, we extend the OHE technique to enable direct imple- nated. (2) As can be seen from Figure 2, only transitions
mentation of NFAs in logic. from the accept states of the NFAs are used to construct
As mentioned in Section 2.2, one of the two dierences bigger NFAs. Thus based on observation (1), the
ip-
ops
between an NFA and DFA is that a state in an NFA can corresponding to the accept states of the NFAs in Figure 2
have more than one outgoing edge with the same label. In do not have to be implemented and can be eliminated.
implementation, this simply translates into allowing a par- Figure 5(a) shows the logic structure that implements the
ticular
ip-
op output to be connected to the input of more NFA (shown in Figure 2(a)) that matches a single character.
than one
ip-
op. The implementation is trivial and enables The
ip-
op corresponding to the accept state has been elim-
multiple transitions and multiple active
ip-
ops every clock inated based on observation (2). The output is 1 only when
cycle. The other dierences is that an NFA can have edges the
ip-
op stores a 1 and the input character matches the
labeled (empty string) while a DFA cannot. Such an edge character stored in the inside the comparator. Figure 6 shows
transfers the status of its source state to its destination state an eÆcient implementation of the comparator. Figure 5(b)
without waiting for the next character to be processed. In shows the logic that implements the NFA for r1 jr2 (shown
implementation, this can be eÆciently achieved by connect- in Figure 2(b)). Both the initial and nal state
ip-
ops are
ing the input (as opposed to output) of the source
ip-
op eliminated based on observations (1) and (2) respectively.
to the input of the destination
ip-
op. Figure 4(a) shows a The only logic required is an OR gate to combine the out-
simple NFA and Figure 4(b) shows its implementation using puts from N1 and N2 . Figure 5(c) shows the implementation
the techniques described above. of the NFA for r1 r2 (shown in Figure 2(c)). As the latter
gure shows, only three edges are required. Thus the logic
3.2 Logic Structures for NFA Implemen- consists of just three wires. Figure 5(c) shows the logic that
implements the NFA for r1 (shown in Figure 2(d)). As in
tation case of r1 jr2 , the initial and nal state
ip-
ops are elimi-
The following two observations help understand the logic nated. An OR gate is used to combine the two inputs to the
structures described below: (1) As stated above, and tran- initial state of N1 . Another OR gate combines the two inputs
sition translates to a connection from the input of the source to generate the accept output.
ip-
op to the input of the destination
ip-
op. Therefore, if The above logic structures can be combined to construct
all transitions from the source
ip-
op are transitions, the an NFA for a given regular expression. To complete the con-
0,1
1
0,1
b
i
1
0 ε a 0 b c
i 0 0 ε
a c Input
(a) (b)
shows the implementation using logic structures of the NFA LUT 1 LUT 2
shown in Figure 3.
i o
i o
Figure 6: Comparator for ASCII character a (01100001).
Output is one only when input to LUT1 is 0110 and to
Flip
LUT2 is 0001.
flop
i
o1 i1
o i
o2 i2
o
3.3 NFA Construction Algorithm
Text c
character N1 Figure 8 shows the algorithm. It accepts the regular expres-
N2
sion in postx form which is easily obtained by postorder
traversal of the syntax tree of the regular expression. For
(a) (c) example, the postx form of the regular expression in Figure
1 is abj cd . The postx form eliminates the need for the
i o i o
( and ) metacharacters and simplies the algorithm.
The algorithm uses a stack data structure and relies on
the following placement and routing subroutines. place char,
place j, place and place respectively place the logic struc-
tures for a character, and the metacharacters j; ; . They all
o1 i1 o2 i2 o1 i1
return a pointer reference (p) to the placed structure. route1
and route2 create bidirectional connections between the logic
i o i o i o
structures the pointers to which are supplied as arguments|
N1 N2 N1 route1(p, q) connects the o output of q to the i1 input of p
and the o1 output of p to the i input of q. Similarly route2(p,
q) creates connections to i2 and o2 ports of p. The last two
(b) (d)
lines of the algorithm create connections to the i and o ports
of the logic structure for the last symbol in the regular ex-
Figure 5: Logic structures for (a) single character, (b) pression. The o port is connected to the input of a
ip-
op
r1 jr2 , (c) r1 r2 , and (d) r1 . and the i port is connected to a permanently high signal.
1 Match
flip-flop
a b c d
Text
character
Figure 7: Implementation of NFA for ((ajb))(cd). Dashed boxes indicate logic structures.
Assuming that the placement and routing subroutines take bits. Reducing this time is important because the NFA can
constant (O(1)) time, the algorithm executes in O(n) time be constructed only at runtime when the user inputs the reg-
where n is the length of the regular expression. The con- ular expression. Thus the NFA mapping time is a part of the
structed NFA can process one character each clock cycle overall execution time and therefore it should be minimized.
(O(1) time). To be able to process one character in O(1)
time on a serial machine, a DFA has to be constructed for The NFA construction algorithm can be implemented as
the input regular expression. Doing so takes O(2n ) time. In a program that reads a regular expression and outputs the
practice, optimized DFA construction techniques can be quite NFA logic in the form of an HDL description, or a technology
fast in many cases but still require exponential time in the mapped netlist, or a placed and routed netlist, or directly in
worst case. In contrast the proposed approach is very fast the form of conguration bits. In the interests of minimizing
(O(n) time) in all cases. the NFA mapping time, direct generation of conguration
The area of the NFA constructed using the proposed ap- bits would be preferable because then the synthesis, place-
proach would depend upon the placement and routing sub- ment and routing steps can be bypassed. However, doing
routines. In Section 4.2 we show an implementation of these so is not practical for following two reasons. The bitstream
which can construct an NFA in O(n2 ) area. On the other formats of commercial FPGAs are proprietary. Also, even
hand, on a serial machine a DFA takes O(2n ) memory. Again, if the formats were known, any error in generating congu-
optimized techniques can often reduce the exponential mem- ration bits could potentially permanently damage the chip.
ory requirement but not always. In contrast, the proposed Thus, the preferred approach would be to implement the NFA
approach requires only O(n2 ) in all cases. construction algorithm as a program that outputs the NFA
logic as a placed and routed netlist and use vendor tools for
generating conguration bits. For Xilinx FPGAs, another ap-
3.4 FPGA Implementation proach would be to write the program in Java and use Jbits
[9] for generating conguration bits.
We now discuss the implementation of the NFA construction
algorithm for an FPGA architecture. The crucial fact to keep For details on the placement of the logic structures and
in mind is that the NFA mapping time needs to be as small routing between them, please see Section 4.2 which describes
as possible. NFA mapping time is composed of the time re- the algorithm implementation for the SRGA architecture.
quired to construct the NFA, generate conguration bits for Implementation for other architectures can be done along
the NFA logic and congure the FPGA with the generated similar lines.
for(i=0; i<regexp_len; ++i) performance gains and a high degree of
exibility can be ob-
{ tained if the device can generate conguration bits at runtime
switch(regexp[i]) and use them to modify its own conguration|the ability of
{ a device to do so is what we call self-reconguration.
case char: place_char(regexp[i], &p); The SRGA is a multicontext device [7][8] which, as the
push(p); name implies, has been designed to support eÆcient self-
reconguration. The key device features that allow it to do
case |: place_|(&p); so are: (1) Single cycle context switching and (2) Single cy-
pop(&p1); cle random access to the conguration memory. Using these
route1(p, p1); two features self-reconguration can be performed quickly as
pop(&p2); shown by the simple example below.
route2(p, p2); The problem is to congure an AND gate but at compile
push(p); time, we don't know where on the logic cell array the AND
gate is to be congured|the location is known only at run-
case .: place_.(&p); time. Since we cannot congure the AND gate at compile
pop(&p1); time, we congure on one of the contexts logic which will in
route1(p, p1); turn congure the gate (as shown in Figure 9). At runtime,
pop(&p2); the coordinates of the logic cell which is to be congured as
route2(p, p2);
an AND gate are sent to the SRGA. The logic congured
push(p);
on the device reads these coordinates. It then computes the
address of the conguration memory locations corresponding
case *: place_*(&p);
pop(&p1);
to the logic cell|bits written into these locations control the
route1(p, p1);
functionality of the logic cell. The congured logic writes bits
push(p);
that congure the logic cell as an AND gate into the above
} conguration memory locations. Finally, it switches to the
} context on which the gate was congured.
pop(&p); Please see [6] for a detailed description of the SRGA device
route_input_high(p); and its capabilities, including row and column routing using
route_output_ff(p); self-reconguration.
Figure 8: NFA construction algorithm.
Section 5 presents the performance evaluation results of the 4.2 NFA Construction using Self-
above approach for the Xilinx Virtex FPGA architecture. As Reconguration
can be seen from the results, the NFA logic is quite fast, but
the NFA mapping time still forms a signicant portion of the Figure 10 shows the algorithm that reads a regular ex-
overall execution time. pression and constructs the corresponding NFA using self-
reconguration. The algorithm is based on the one shown in
Figure 8 and it operates in a similar manner|the basic dier-
4 NFA Construction using the ence is that the above algorithm explicitly places and routes
the NFA logic. Figure 11 shows the simplied block diagram
SRGA of the datapath and control logic that implements the algo-
rithm. The placement and routing performed by the algo-
We describe in the following sections how the NFA mapping rithm is explained below. For clearer understanding, please
time can be further reduced by using self-reconguration. Us- also see the step by step NFA construction by the algorithm
ing self-reconguration, the NFA construction algorithm can for ((ajb))(cd) shown in Figure 12.
be implemented as congured logic on the device itself. The To perform placement and routing, it is necessary to spec-
logic reads the input regular expression, directly generates ify the logic cell in which a particular logic structure is (or is
the conguration bits for NFA logic and congures the de- to be) congured. The counters row and col, and the regis-
vice with them, all without any external intervention. All ters row1, col1, row2 and col2 are used for this purpose. Note
three components of the NFA mapping time|NFA construc- that the stack pushes or pops a row, column pair in a single
tion, conguration bits generation and device conguration| operation.
are signicantly reduced using self-reconguration. We start The algorithm places the NFA logic as a binary tree, sim-
with a description of the SRGA device. ilar to the syntax tree of the regular expression. At the leaf
nodes are the logic structures for the characters and all of
4.1 Self-Recongurable Gate Array them are placed in the same row (row 0). The column is
(SRGA) specied by the column counter col which is incremented each
time one is placed. Placement in the same row simplies
In most cases, conguration of an FPGA, whether at com- broadcasting the text characters to the comparators.
pile time or at runtime, is performed externally. Much greater Logic structures for the metacharacters form the non-leaf
AND gate
Context 3 Context 3 configuration Context 3
bits
Logic for Context Context Context
configuring Switch Switch AND gate Switch
the AND gate Interface Interface Logic
Context 2 Context 2 Context 2 Active
Gate C C context
coordinates Config. Config. Config.
Memory Memory Memory
Interface Context 1 Interface Context 1 Interface Context 1
Active Active
context context
Memory
regexp[i] i
inc_i
comparator row_inc col_inc
char
row col
|
. Control
logic row2 col2
*
row1 col1
push pop
OR
Stack
Figure 11: Datapath and control logic for NFA construction algorithm.
nodes are each is placed on a separate row. The row is spec- tion time using the SRGA in Section 5.4.
ied by the row counter which is incremented each time one
is placed. Note that the logic structure for one of the child
nodes of j or is always in the preceding row. Thus by placing 5 Performance Evaluation
the logic structure for j or in the same column as the other
child node, connections to the former and latter child nodes In this section, we compare the performance of FPGA imple-
require only row and column routing operations respectively. mentations based on the proposed approach with the perfor-
The logic structure for the only child node of is always in mance of the Unix text search program grep. The comparison
the preceding row. Thus the connection can be established by criteria are the space needed (memory for software and area
simply placing the logic structure for in the same column. for FPGA), and the time required for FA construction and
The numbers at the end of each line in Figure 10 indicate text character processing.
the number of clock cycles required to complete the operation To compare performance, the regular expression we match
on that line. Counter increments occur in parallel with other using both approaches is (ajb) a(ajb)(ajb):::(ajb) which has k
operations and so take zero clock cycles. As described in [6], occurrences of (ajb) at the end (henceforth written as (ajb)
any row or column routing operation on the SRGA takes 4 a(ajb) ). It matches any sequence of a's and b's in which the
k
clock cycles.The row route and col route operations create (k +1)th character from the right is a. Any DFA constructed
bidirectional connections. Also, it takes 2 clock cycles to for the above expression would have at least 2k states. We
switch to the context with the routing logic and back to the vary k to study performance variation with regular expression
current context. Thus each routing operation takes 10 clock length.
cycles. The time for the placement operations is composed of
the time to congure the logic structure in a logic cell (1 bit
per clock cycle) and 2 clock cycles for the context switching. 5.1 Software Performance
The time for the character logic structure also includes the 32
clock cycles required to congure the comparator as shown The software performance evaluation was carried out on a
in Figure 6. machine with an 800 MHz Pentium III Xeon processor with
The above algorithm thus congures the logic structures 2 GB RAM running Linux (Red Hat 6.2). The regular ex-
for char, j, and in 47, 45, 37 and 28 clock cycles respec- pression matching program used was GNU grep version 2.4,
tively. We use the above numbers to estimate NFA construc- which is distributed with the OS. The times reported are user
i=0, regexp[i]=a line=7 i=1, regexp[i]=b line=7 i=2, regexp[i]=| line=12
row=1 row=1 row=1, row1=0, row2=0
3 col=0 3 col=1 3 col=1, col1=1, col2=0
0,1
0,0 0,0
Stack Stack Stack
2 2 2
1 1 1
a a b a b
0 0 0
0 1 0 1 0 1
(a) (b) (c)
2 2 2
1 1 1
a b a b a b
0 0 0
0 1 0 1 0 1
(d) (e) (f)
1
Text
a b c d
0
0 1 2 3
(g)
Figure 12: NFA construction for ((ajb))(cd) (postx form: abj cd ). Each gure shows NFA construction
immediately after the algorithm statement indicated by line has been executed.
1 row=1; col=0; i=0; [1] space CPU times and the memory reported is the maximum
2 while(i<regexp_len) memory used by that invocation of grep.
3 { GNU grep uses clever techniques to reduce time and space
4 switch(regexp[i]) requirements. Essentially, it constructs the DFA for the given
5 { regular expression in a \lazy" manner|only those parts of
6 case char: place_char(regexp[i], col); [46] the DFA are constructed which are required to process the
7 push(0, col); [1] input text characters. For instance, if none of the characters
8 ++col; [0] in the regular expression occurs in the text, the DFA is not
9 constructed at all. Therefore, to force grep to construct the
10 case |: pop(&row1, &col1); [1] entire DFA, the input text must contain patterns that exer-
11 pop(&row2, &col2); [1] cise all possible DFA transitions. For example, for the regular
12 place_|(row, col2); [22] expression (ajb) a(ajb)2 , we create a le containing the four
13 route_row(col2, col1); [10] lines \aaa", \aab", \aba" and \abb"1 .
14 route_col(row, row2); [10]
15 push(row, col2); [1] k Text le CPU Maximum Time per
16 ++row; [0] size (bytes) time memory character
17
19 case .: pop(&row1, &col1); [1] 8 2560 0.00 s 580 KB {
20 pop(&row2, &col2); [1] 9 5632 0.00 s 580 KB {
21 place_.(row, col2); [14] 10 12288 0.00 s 580 KB {
22 route_row(col2, col1); [10] 11 26624 0.00 s 580 KB {
23 route_col(row, row2); [10] 12 57344 0.00 s 580 KB {
24 push(row, col2); [1] 13 122880 0.005 s 580 KB {
25 ++row; [0]
26
14 262144 0.01 s 580 KB {
27 case *: pop(&row2, &col2); [1] 15 557056 0.03 s 580 KB 53.86 ns
28 place_*(row, col2); [26] 16 1179648 0.04 s 580 KB 33.91 ns
29 push(row, col2); [1] 17 2490368 0.08 s 580 KB 32.12 ns
30 ++row; [0] 18 5242880 0.17 s 580 KB 32.42 ns
31 19 11010048 0.34 s 580 KB 30.88
32 }
33 ++i; Table 2: Results for text search and minimal DFA con-
34 } struction (best case).
35 pop(&row, &col); [1]
36 route_input_high(row, col); [3]
37 route_output_ff(row, col); [3]
Such les were created for regular expressions ranging from
k = 8 to k = 19. The program grep was then invoked once for
Figure 10: NFA construction algorithm. each regular expression with the corresponding le supplied
k Text le CPU Maximum as input text. The results, summarized in Table 1, demon-
size (bytes) time memory strate the exponential blowup in both time and memory that
can occur in DFA based regular expression matchers. As k is
8 2560 0.01 s 1 MB increased, time required and memory consumed quickly reach
9 5632 0.05 s 1 MB unacceptable levels. It should be clear from the above discus-
10 12288 0.15 s 1.9 MB sion that times obtained are composed of the worst case DFA
11 26624 0.50 s 2.2 MB construction time (since the complete DFA is constructed)
12 57344 2.22 s 3.0 MB plus the time required to search (k + 2)2k text characters.
13 122880 16.11 s 4.4 MB We now determine time and memory requirements for best
14 262144 82.88 s 7.5 MB case DFA construction (only one transition computed). We
do this by repeating the above experiment with text les of
15 557056 345.33 s 13 MB the same sizes as before, but with each le containing 2k
16 1179648 1383.55 s 26 MB repetitions of a single pattern instead of 2k patterns. Table
17 2490368 5499.60 s 54 MB 2 shows the results obtained. The assumption that the DFA
18 5242880 21900.36 s 111 MB construction takes negligible time is justied by the almost
19 11010048 87309.38 s 229 MB constant time per text character in all cases. Also, all the grep
invocations use the same amount of memory. Therefore, we
Table 1: Results for text search and complete DFA con- 1 In general, for (ajb) a(ajb)k , the number of lines in the text
struction (worst case). le would be 2k . Each line would have k+2 characters (the (k+2)th
character being a newline) resulting in a le size of (k +2)2k bytes.
k DFA Construction k Conguration FPGA NFA
size time bit generation conguration construction
8 420 KB 0.01 s 8 20 ms 1 ms 21 ms
9 420 KB 0.05 s 9 38 ms 1 ms 39 ms
10 1.32 MB 0.15 s 10 31 ms 1 ms 32 ms
11 1.62 MB 0.50 s 11 33 ms 1 ms 34 ms
12 2.42 MB 2.22 s 12 30 ms 1 ms 31 ms
13 3.82 MB 16.11 s 13 28 ms 1 ms 29 ms
14 6.42 MB 82.87 s 14 32 ms 1 ms 33 ms
15 12.42 MB 345.30 s 15 33 ms 1 ms 34 ms
16 25.42 MB 1383.51 s 16 33 ms 1 ms 34 ms
17 53.42 MB 5499.52 s 17 36 ms 1 ms 37 ms
18 110.42 MB 21900.19 s 18 36 ms 1 ms 37 ms
19 228.42 MB 87309.04 s 19 30 ms 1 ms 31 ms
28 38 ms 1 ms 39 ms
Table 3: Software performance results for DFA construc-
tion (worst case). Best case requires negligible time and Table 4: NFA construction times.
memory. Time per text character is 30.88 ns.
7 Acknowledgement
The authors thank Bharani Thiruvengadam for his work on
FPGA performance evaluation.