This is a technical brief on regular expression which is at the heart of all signature and pattern matching systems such as Anti-Virus, Anti-Spam, Intrusion Prevention Systems, Data Loss Prevention, Web Application Firewalls etc.

Vatsal Mehta –

Regular Expressions are the heart of most security solutions. Learn about the different tradeoffs in designing high performance and scalable regular expression matching systems.

Regular Expression Matching (i.e. the technology by which you match a given set of expression with a larger set of data) is a technical term used in defining many of today's security technologies such as Gateway Anti-Virus, Intrusion Detection and Prevention Systems, Data Loss Prevention systems, Unified Threat Management Systems, Cross Site Scripting Protection, URL Filtering etc. There is a constant stream of data which keeps being checked for known threats or attacks. While RegEx matching is at the heart of these techniques, many advanced techniques can be additionally used to prevent or detect attacks such as correlation of data from multiple sites, analysis of multiple connections.

From a user's perspective the biggest complaint of a RegEx matching system are false positives. A false positive is like the story of child crying wolf. If your system generates too many alerts, then the accurate alerts decrease in relevance. Another issue is false negatives; the system might miss some attacks. The third issue is signature updates, how long does signature updates take, how frequently they are required and what does the system do to active sessions. These are all valid topics and we will discuss them in this article after discussing the basics of RegEx matching. There are primarily two types of RegEx matching technologies: a) Deterministic Finite Automata – DFA (


b) Non-Deterministic Finite Automata - NFA ( We'll go into the details of these in subsequent articles. For any kind of RegEx matching, the most important criteria is a rule set. The regular expression rules, that the string is supposed to be matched against. For today's speeds, for a personal computer this can be done in software, but for multiple machines behind a gateway, this has to be done in hardware. The signature rule set is then "compiled" into a format that is easily used by the RegEx matching engine. This might remind you of your anti-virus engine downloading signature updates frequently. These RegEx patterns are specific to the application. For example a URL Filtering system will be looking for URLs in a stream of data. It may need to understand the different protocols that go through the system, and as soon as it sees a defined protocol, it will start the state transitioning process. For a page that does specific text matching, the rules may become more complex - remember the data is a stream of bits and bytes to the engines and everything is a sets of 0s and 1s. So let's say we are interested in looking for the word "cooking" in a set of web pages. First we need to understand web pages. Web pages have different encoding schemes, the can be in different languages or can be different versions of the same language (Unicode). And last but not least, it can be case sensitive. A good RegEx engine will take all this in consideration when it compiles the rule set, so the rule set doesn't grow exponentially - it would not be expensive to catch CooKing as well as CooKING through this method. So the state transitions would be similar to the following: State 1: c or C Sate 2: o or O (if not go back to State 1) and so on ... Now comes the tough part of signature updates. Let's say you have just discovered a serious new attack and you created a signature for it. You'll need to push this to the field with the least disruption and maximum protection possible. What are all the decision factors? What is the processing power needed to compile the new rule set? This would help decide whether the compiling takes place in the field or at the headquarters. Can the new signature be an incremental update or the whole rule set needs to be recompiled. Of course we would all like an incremental update, but that depends on the technology and the trade-offs we did on the regular expression engine. How are existing sessions being handled. Do they need to be dropped? This probably depends on the severity of the vulnerability. Typically a RegEx matching system is a parallel processing engine, and some engines can be turned off to do updates while other engines process the traffic... If you want to play with a RegEx engine I would suggest the following: - Snort ( - Modsecurity ( Now let's talk about False Positives:



False positives are a fact of life. They cannot be solved completely, but strategies can be created that'll reduce their impact. For example, a match can also be accompanied with the severity of the match as well as how many times it happened in the past and if there are other people seeing it. This is done by correlation of analysis statistics. Matching the match with other characteristics of the attack - for example a Worm would be matched with an increased traffic to a particular site. Rules regarding the match can be streamlined, for example this particular worm can only happen on HTTP, FTP. Avoid all other protocols. False negatives are more dangerous - it means we missed an attack and compromised our security. By analyzing the causes these happen and adopting multiple technologies for detection would mitigate the risk of false negatives. Common causes are New or Customized Attacks that are not previously seen, Bugs in software, Misconfiguration... For more details on mitigating false positives and negatives go to I hope this article is informative. If you have any comments please feel free to contact me


Sign up to vote on this title
UsefulNot useful