Improving the performance of regular expressions

BMC Helix Discovery allows users to enter regular expressions (regexes) for a number of purposes, such as:

  • associating TPL rules with events
  • processing information in the body of a TPL pattern
  • filtering out sensitive information from command lines.

These regular expressions are often matched against a large data set, so it is important that they are written to be efficient.

Why is regular expression efficiency important?

While a well written regular expression can be very efficient, a badly written regular expression might take a long time to run and slow the system down significantly. It is quite possible to write a regular expression that will take hours or days to finish - it is even possible to write a regular expression that will not finish within the lifetime of the universe when run against moderately sized strings.

Several improvements have been made in BMC Helix Discovery to make it more robust against inefficient regular expressions than previous versions. It now minimizes the regular expression matching needed when deciding which TPL patterns to run. It also spreads the work of running TPL patterns among multiple processors so that if one is busy with a long regular expression match, the other processors can carry on working.

Despite the improvements, writing efficient regular expressions is still important for keeping BMC Helix Discovery running at its best. If BMC Helix Discovery slows down significantly while scanning a network and either the reasoning or discovery service are using 100% CPU for long periods then one possible cause is an inefficient regular expression.

Anatomy of an inefficient regular expression

So how do you write an inefficient regular expression? One problem is when the regular expression does excessive backtracking; this can happen when there is more than one repetition operator in the regular expression. A repetition operator is +, *, or {n,m}. If the regular expression makes a partial match but then fails, then it must loop back and try any other possible partial matches in case any of them succeed.

For example, consider matching the regular expression a.*b.*cd against the string abc abc abc. The match will never succeed since there is no d in the string, but the regular expression must still check every possible combination of the letters a, b, and c before it gives up. That is:

"*abc* abc abc",
"*ab*c ab*c* abc",
"*ab*c abc ab*c*",
"*a*bc a*bc* abc",
"*a*bc a*b*c ab*c*",
"*a*bc abc a*bc*",
"abc *abc* abc",
"abc *ab*c ab*c*",
"abc *a*bc a*bc*",
"abc abc *abc*"

As a rough guide the number of comparisons that the regular expression needs to perform is proportional to the length of the string times the number of possible intermediate matches.

In this example using the non-greedy operators, that is, a.*?b.*?cd, makes no difference to the number of matches it will make, since the regular expression engine still needs to try every combination.

Real World examples

Let's take a look at some examples based on real regular expressions that have caused problems in BMC Helix Discovery:

\b.*xx.*foo

This was a regular expression that was compared against the command line of processes found on a host. It was failing when run against a half-megabyte string which included lots of repetitions of xx but did not contain foo.

Let's break down what happens when it is matched against a string:

  1. The engine starts scanning from the start of the string.
  2. The engine scans forward until it finds a word boundary \b.
  3. The engine scans forward from the word boundary until it finds a matching xx.
  4. The engine scans forward from the xx until it finds the string foo or it reaches the end of the string.
  5. If it has reached the end of the string and not matched foo, it loops back to step 3 and scans forward to the next xx.
  6. If it has matched all the xx and still not found foo, it loops back to step 2 and scans forward to the next word boundary.

So the regular expression matching contains nested loops; the total processing time is determined by the length of the string (for the command line that was causing the problem this was approximately 500,000) times the number of xx substrings (approximately 500) times the number of word boundaries (approximately 80,000). This was roughly equivalent to scanning a string twenty trillion characters long, and took more than a day to complete.

This was fixed in two ways:

Firstly, the \b.* was removed from the start of the regular expression, since it served no purpose other than to slow the whole thing down. This change reduced the runtime from days to a few seconds.

Secondly, we can use knowledge about the data we want to match; in this case we are only interested in the text from the start of the string up to the first space. So to stop the regular expression scanning the entire string we can anchor the regular expression to the start of the string with ^ and use the \S token to match non-whitespace characters. The final regular expression ^\S*xx\S*foo will stop as soon as it reaches a whitespace character. It now takes a few microseconds when run against the same string.

-A(\D+)+-B(\D+)

This was used as a sensitive data filter. The intention was to scan command lines and remove options that start with -A and -B. However it not only did not do what the writer intended, it performed in a manner that could potentially take forever to process.

Let's break it down:

  1. Scan from the start of the string until we find -A.
  2. Match all non-digit characters until we find -B.
  3. If -B is not found then try matching every combination of groups between -A and the end of the string or the next digit. For example, if the remainder of the string was abcd then it would match each of the following groups for (\D+)+
    (abcd)
    (abc)(d)
    (ab)(cd)
    (ab)(c)(d)
    (a)(b)(cd)
    (a)(bc)(d)
    (a)(bcd)
    (a)(b)(c)(d).

The number of combinations will double for each additional character in the remaining string.

So for the situation where the command line contains -A but not followed by -B it will take a time proportional to 2 N, where N is the number of characters between the -A and the next digit or end of the string. To put that into perspective, on a standard PC a string of 22 characters takes about one second. A string of 40 characters would take about 3 days, and a string of 100 characters would take 9,583,696,565,945,500 years, though this has not been tested.

This regular expression was fixed by removing the group repetition, since it served no purpose:

-A(\D+)-B(\D+). The runtime went down from forever to a few microseconds.

Guidelines for writing efficient regular expressions

This section provides guidelines to help you avoid common problems and write efficient regular expressions.

Consider the failure scenarios

As the previous examples show, the problems occur when a regular expression fails to match completely, but there are multiple partial matches. When writing a regular expression it is not sufficient to consider what happens when the regular expression succeeds, but also how it performs when it fails. The regular expressions used in BMC Helix Discovery often are matched against a large number of command lines which can be very long (up to a million characters is not unknown), and might contain text that matches parts of your regular expression but not the whole thing.

Beware of multiple repetitions of wildcard tokens

A wildcard token is anything that can match more than one character; this includes: ., \w, \D, character classes, negated character classes and so forth.

If you have a regular expression with two or more consecutive wildcard repetitions then there is the possibility of backtracking. For example, if the target string starts with a, has a length of N characters, and does not contain an x, then:

  • a.*.*x - will take N 2 matches.
  • a.*x*.*x - will also take N 2 matches, since x* can be a zero-length match so can match anywhere in the string.
  • a.*y.*x - will take N*M matches, where M is the number of occurrences of y.

REALLY beware of nested group repetitions

As described above, if there is a group that contains a repetition, and the group itself is also repeated, for example (.*)*, then the number of possible matches might increase exponentially.

Do not start a regular expression with wildcard repetitions

This is a special case of the second point above. The regular expression engines searches for a match anywhere in the string, so it tries to match the regular expression starting at the first character, then at the second character, and so on until it gets a match. A regular expression of .*x is equivalent to a regular expression of ^.*?.*x which suffers from the backtracking problem described above.

Since this is a very common error, BMC Helix Discovery looks for regular expression that start with the unnecessary .* and strips it out where possible.

Only match what you really need

The regular expression should be designed to stop as soon as it has enough to match, or to know that it cannot match. For example consider the regular expression \b\w+XXX.*

  • the \b\w+ is redundant, and can be replaced by "\w". This will match in all the situations where the original regular expression matched.
  • the .* at the end is also redundant, since it will have no effect on whether the match will succeed or fail.

So the entire regular expression can be replaced with \wXXX.

The exception to this is in situations where you need to use the string that has matched, rather than simply test if a match has succeeded or failed.

Try to fail fast

Try to make the entire regular expression fail if it reaches a point where it could not possibly match the intended target.

An example of this is in the first real world example shown above. The regular expression ^\S*xx\S*foo will never scan a string past the first whitespace character, regardless of whether it succeeds or fails. In the context in which it was being used this meant the difference between scanning a sequence of perhaps a hundred characters, and scanning a sequence of hundreds of thousand characters.

Profile - especially the failure cases

It is important to test your regular expression to make sure that it matches what you expect it to. However it is also important to test it for performance against long strings that partially match your regular expression, for example a megabyte string of random characters.

Minimize the data you pull back from hosts

TPL patterns let you run commands on a host and retrieve data to be searched with a regular expression, for example for versioning information.

A common mistake is to pull back a huge amount of information, for example:

cat file1 file2 file3...

then running a regular expression on the data to extract one piece of information.

This can potentially return a lot of data, so it will not only take a long time for the regular expression to match, it will also be slow to transfer the data over the network and take up a lot of space in the datastore.

A better way is to only get the information that you are interested in by running commands on the host such as grep or head. For example:

grep VERSION file1 file2 file3...

You can then run the regular expression on the much shorter text that is returned.

Do not use groups unless absolutely necessary

When you enclose part of a regular expression in a group using parentheses, the regular expression engine must do extra work to save the text matched by the group in case it is needed later. This can slow the matching process down, in some cases by a factor of four or more.

If you need to use parentheses, for example for part of a regular expression that is repeated, but you do not need to use the contents of the group afterwards then you can use the non-grouping version of parentheses - (?:...). This is generally still slower than not having parentheses at all, but is faster than normal parentheses. For example:

  • (abc|def) -* slow. Only use if you need to use the matched text later.
  • (?:abc|def) - faster
  • abc|def - fastest

Think about the regular expression

While these guidelines might help, there is no substitute for the thought and understanding that comes from taking a step back re-examining your regular expression for efficiency and accuracy.

Regular expression optimizations for TPL triggers

BMC Helix Discovery uses a regular expression index to improve pattern performance. The index is used to quickly eliminate those regular expressions that will obviously not be matched by a given string. This significantly reduces the number of regular expressions that must be executed. Consequently, reasoning can usually avoid executing the sort of pathological regular expressions that are described in the rest of this document.

When optimizing regular expressions for TPL pattern triggers it helps to have a basic understanding of how the index works.

The index

Regular expressions are indexed by a property called a hint. A hint is a case-insensitive string that is a substring of every string that matches the expression. For example, the expression Hoist the (mizzen mast|main brace), ye (landlubbers|scurvy dogs)! has three obvious hints:

  • {{Hoist the }}
  • , ye "
  • !
    For simplicity, each expression is indexed only by its longest hint. In the case of the example above, the hint would be {{Hoist the }}.

Some regular expressions do not have any hints. For example, \d+[-/]\d+[-/]\d+ has no substring that must be part of the matching strings. The hint calculator is also (currently) fairly naive, and misses some hints that could potentially be extracted. Expressions that have no hints must be treated separately.

When trying to match a string, the index is queried for a set of the regular expressions whose hint appears as a substring of the string being matched. Once built, the index can perform this query very quickly. This set, combined with a set of expressions that have no hints, forms the set of all the regular expressions that can potentially be matched by the string. An attempt is made to match the string against each expression in turn until one is found or there are no expressions left, meaning that the string does not match.

Try to give a hint

Regular expressions with no hints must be run against every string given to the index. It is important, therefore, to try to use regular expressions for which hints can be calculated.

The index's hint extraction algorithm cannot cope with the following kinds of sub-expression and will use them to delimit hints:

  • Built-in character classes such as \d, \w, and .
  • Custom character classes such as [ijkxyz]
  • Optional expressions such as a? and a*
  • Groups such as (foo)+
  • Alternation such as foo|bar

Using alternation at the top level of the expression always prevents hints being extracted. Alternation inside a group does not prevent hint extraction from parts of the expression outside the group.

As a rule of thumb, fancy characters terminate hints.

Try to make unique hints

If your regular expression only has a hint like java, it will need to be checked against a lot of strings. Try to engineer your expressions so that their longest hints are fairly uncommon.

Further Reading

This section provides some suggested further reading.

Mastering Regular Expressions by Jeffrey Friedl

This O'Reilly book is the authority on regular expressions. Chapter 4 describes the mechanics of how regular expression engines match text, and Chapter 6 describes writing efficient regular expressions in greater depth than this short document.

The Python regular expression Library Documentation

The full specification of the regular expression syntax used in BMC Helix Discovery can be found on the Python website.

Useful tools

There are a lot of useful tools for developing and testing regular expressions, and a few that have been found useful are listed below.

BMC Helix Discovery uses the Python regular expression library. Other libraries might have different features, behavior and performance.

Kodos — the Python regular expression debugger

Kodos is a GUI for testing Python regular expression. It does not give timing information, but is useful for trying regular expression against test data and seeing what matches. It is open source and runs on Windows, Linux and Mac OS X.

Ipython

Ipython is an enhanced interactive python shell. It is useful for timing regular expression matches using the timeit command. For example:

In [1]: import re
In [2]: x = '-' * 1000000 + 'abc'
In [3]: timeit re.search('abc', x)
100 loops, best of 3: 2.28 ms per loop
In [4]:

Ipython is open source and runs on Windows, Linux and Mac OS X.

Regex Coach

Regex Coach is a GUI regular expression debugger similar to Kodos. The big advantage of Regex Coach is that you can step through the regular expression as it attempts a match and watch the backtracking in action. This is very useful for getting a deeper understanding of how regular expressions work, and for debugging regular expressions that take much longer than expected.

Regex Coach is written in Lisp but the regular expression library is similar to the Python library. The current version is available for Windows only but earlier versions are available for Linux and Mac OS X. It is free to use, but cannot be distributed commercially without permission.

Was this page helpful? Yes No Submitting... Thank you

Comments