You are on page 1of 21

Efficient Regular Expression Evaluation: Theory to Practice

Michela Becchi and Patrick Crowley

ANCS08

Motivation

Size and complexity of rule-set increased in recent years


Snort, as of November 2007
8536 rules, 5549 Perl Compatible Regular Expressions 99% with character ranges ([c1-ck],\s,\w) 16.3 % with dot-star terms (.*, [^c1..ck]* 44 % with counting constraints (.{n.m}, [^c1..ck]{n,m})

Several proposals to accelerate regular expression matching


FPGA Memory centric architecture

Michela Becchi 2/27/2008 11/06/2008

Objectives

Can we converge distinct algorithmic techniques into a single proposal also for large data-sets? Can we apply techniques intended for memory centric architectures also on FPGAs?

Provide tool to allow anybody to implement a high throughput DPI system on the architecture of choice

Michela Becchi 2/27/2008 11/06/2008

Target Architectures
Regex-Matching Engine

Memory-centric architectures

FPGA logic

General purpose processors

Network processors

FPGA / ASIC + memory

available parallelism

Michela Becchi 2/27/2008 11/06/2008

Challenges
DFA NFA Memory-centric architectures

FPGA logic
Logic cell utilization Clock frequency

General purpose processors

Network processors

FPGA / ASIC + memory

Memory space Memory bandwidth

Michela Becchi 2/27/2008 11/06/2008

D2FA: default transition compression

Observations:
DFA state: set of || next state pointers Transition redundancy

Idea:
Differential state representation through use of non-consuming default transitions
a s1 b s3 s4 s5 s3 b c s4 s6 s2 c s6 s1 a b c s3 s4 s5

c
a

s2

In general:

DEFAULT PATH

c1
Michela Becchi 2/27/2008 11/06/2008

c2

c3
6

c1

c4

D2FA algorithms

Problem: set default transitions so to


1. 2. Maximize memory compression Minimize memory bandwidth overhead

[Kumar et al, SIGCOMM06]


Bound dpMAX on max default path length O(dpMAX+1) memory accesses per input char Better compression for higher dpMAX

[Becchi et al, ANCS07]


Only backward-directed default transitions (skipping k levels) Amortized memory bandwidth O((k+1/k)N) on N input chars Depth-first traversal at DFA creation

Memory bandwidth = O((dpMAX+1)N) Time complexity = O(n2logn) Space complexity = O(n2)

vs.

Memory bandwidth = O((k+1/k)N) Time complexity =O(n2) Space complexity =O(n)

Compression w/ k=1 ~ compression w/ dpMAX=


Michela Becchi 2/27/2008 11/06/2008
7

DFA alphabet reduction


[a-z] 3/1 [a-zA-Z]

1
0
[0-9B-Z] A 4/2 [a-zA-Z]

Effective for: Ignore-case regex Char-ranges Never used chars

2
A

5/3

[B-Z]

[a-z]

0 1 2 3 4

3/1

[0-2]

A [B-Z] [0-9] [^0-9a-zA-Z]

0
[2-3] 1

4/2

[0-2]

2
1

5/3

Alphabet translation table

Michela Becchi 2/27/2008 11/06/2008

Multiple-stride DFAs

[Brodie et al, ISCA 2006] Idea:


Process stride input chars at a time
DFA
a:1-8 a 0 b 5 b:2-8 c 6 e 7 f
8/2

DFA w/ stride 2
d b 2 c 3 e d
4/1

[a-f]a [a-cef]a 1 1
[a -f] a

da

1/1

bc

dd

ab 2 2
[b-f]b

4/1

ab

[b-f ]b bc

5 6

Observations:

Mechanism used on small DFAs (1-2 regex) No distinct accepting state handling
Michela Becchi 2/27/2008 11/06/2008
9

Multiple stride + alphabet reduction

Stride s Alphabet s
=ASCII alphabet | 2|=2562=65,536; | 4|=2564~4,294M

Effective alphabet much smaller


Char grouping: [a-cef]a, [b-f]b 2-DFA
[a -f] a
[a-f]a [a-cef]a 1 ab 2 bc 3

da
dd

1/1

DFA
a 0 b

a:1-8 1 b 2 c 3 e 5 b:2-8 c 6 e 7 f d

d
4/1

4/1

ab
0

[b-f]b [b-f]b

8/2

[b-f ]b bc

5 6

Alphabet reduction may be necessary to make stride doubling feasible on large DFAs
DFA alphabet reduction Stride doubling alphabet reduction 2-DFA Stride doubling alphabet reduction 4-DFA

TxTable1

TxTable2,1
10

TxTable4,2,1

Michela Becchi 2/27/2008 11/06/2008

Multiple stride + default transitions

Compression
Default transitions eliminate transition redundancy In multiple stride DFAs
# of states does not substantially change # of transitions per state increases exponentially ( stride ) Fraction distinct/total transitions decreases Increased potential for compression!

Accepting state handling


DFA
a 0 b 5 b:2-8 c 6 e 7 f
8/2

2-DFA
d
4/1

[a-f]a 1 bc 3
dd

1/1

a:1-8 1 b 2 c 3 e d

4/1
0/1

2 0 5

Duplicated states have same outgoing transitions as original states but different depth
Default transition will remove all outgoing transitions from new accepting states
Michela Becchi 2/27/2008 11/06/2008
11

Multiple stride + default transitions (contd)

Problem:
For large and stride, uncompressed DFA may be unfeasible
Out of memory when generating a 2K node, stride 4 DFA on a Linux machine w/ 4GB memory

Solution
Perform default transition compression during DFA creation
Use [Becchi et al, ANCS 2006] compression algorithm In the situation above, only 10% memory used
alphabet reduction Stride doubling + compression alphabet reduction compressed 2-DFA Stride doubling + compression alphabet reduction compressed 4-DFA

DFA

TxTable1

TxTable2,1

TxTable4,2,1

Michela Becchi 2/27/2008 11/06/2008

12

Putting everything together


1-22 regex 48-1,940 states

DFA

alphabet reduction Stride-2 transformation

||=25-44 96.3-98.5% transitions removed

default transition compression

Stride-2 DFA

Compressed DFA

avg 1-2 labeled tx/state

||=53-470

alphabet reduction

97.9-99.5% transitions removed

default transition compression

Same memory bandwidth requirement Initial size=40X-80X final size

Avg 3-5 labeled tx/state

Compressed Stride-2 DFA


13

Michela Becchi 2/27/2008 11/06/2008

NFA
b

1
a a *

2
b

3 7
*

4/1

1. 2. 3. 4. 5.

5
a

b b

6
b

8/2

ab+cd ab+ce ab+c.*f b[d-f]a bdc

0
b b

9 13 16
d-f
d

10

11
15/4

12/3

14 17

a
c

4/1 5/2

18/5
a * b

3
* f f

1
b

2
*

0 8
e-f d

6
10/4

7/3

a a
c

11
Michela Becchi 2/27/2008 11/06/2008

12/5

14

Multiple stride + alphabet reduction

Stride doubling
NFA
* * b

4/1

Avoid new state creation

5
* c

e cd

6/2

2-NFA
* ab b.

.c
bc

d.

4/1

Keep multiple transitions on the same symbol separated

.a
ac

1
cc

ce ce,e.

5
cc

6/2

Alphabet reduction:
Clustering-based algorithm as for DFA, but sets of target states are compared

Michela Becchi 2/27/2008 11/06/2008

15

FPGA implementation
INIT INPUT klog|| log||
r MATCH

Alphabet Tx
CLK

Decoder

||

NFA

Quine-McCluskey like minimization scheme

One-hot encoding [Sidhu & Prasanna]

S2 ci S3 ck
c1

cm S1 cn
=

S2
S3

S1
cm cn

ci ck

+ logic reduction schemes

c2
-{bBcCdD}={aA}
(c1=b OR c1=B) AND NOT (c2=a OR c2=A)

S2 ci

S1

S2

S3

S1
reset

S1
-{ci,ck}

= S3

S1
ci ck

ci

Michela Becchi 2/27/2008 11/06/2008

16

FPGA Results - throughput


8 7

Throughput (Gpbs)

6 5 4 3

stride 1, full alp. stride 1, red. alp.


stride 2, red. alp.

2
1

0
any_99 mail_79 http_406

Rule-set

Michela Becchi 2/27/2008 11/06/2008

17

FPGA Results logic utilization


4000 3500
#s=7,864 1=64 2=2206 #s=2,086 1=78 2=1,969

3000 2500

# slices

2000
1500 1000 500 0

#s=2,147 1=68 2=1640

stride 1, full alp. stride 1, red. alp. stride 2, red. alp.

any_99

mail_79

http_406

Rule-set

Utilization:
8-46% on XC5VLX50 device (7,400 slices) XC5VLX330 device has 51,840 slices

Michela Becchi 2/27/2008 11/06/2008

18

ASIC projected results


Regex partitioning into multiple DFAs Stride = 1 Memory footprint Rule -set || k-NFA any k-DFA any1 any2 any3 78 59 45 60 #states 2,086 23,846 86,977 14,084 Compressed Full states states 505KB 2.9 MB 299MB 200 KB 55 KB 48 KB || 1969 850 579 627 #states 2,091 28,223 102,940 19,344 Content addressing w/ 64 bit words: -98% states compressed w/ stride 1 -82% states compressed w/ stride 2 Stride = 2 Memory footprint Compressed Full states states 356KB 1.27MB 244KB 32MB 81MB 16 MB

Throughput: SRAM@500 MHz 2-4 Gbps for stride 1 4-8 Gbps for stride 2

Alternative representation: decoders in ASIC or instruction memory

Michela Becchi 2/27/2008 11/06/2008

19

Conclusion

Algorithm:
Combination of default transition compression, alphabet reduction and stride multiplying on potentially large DFAs Extension of alphabet reduction and stride multiplying to NFAs

FPGA Implementation:
Use of one-hot encoding w/ incremental improvement schemes Logic minimization scheme for alphabet reduction & decoding

Additional aspects:
Multiple flow handling: FPGA vs. memory centric architectures Design improvements tailored to specific architectures and data-sets:
Clustering into smaller NFAs and DFAs to allow smaller alphabets w/ larger strides

Michela Becchi 2/27/2008 11/06/2008

20

Thank you!

Questions?

http://regex.wustl.edu

Michela Becchi 2/27/2008 11/06/2008

21

You might also like