Thursday, June 20, 2019

Zeke on Zeek: Paraglob

Paraglob is a data structure for quick string matching against a large set of patterns. It was originally designed by Robin Sommer, but an early, experimental implementation was slowed significantly by an internal set data structure that ran in linear time for most of its operations. As a result of a couple of these linear time operations being called together, building a paraglob took O(N2) and other operations took O(Nlog(N) time where N is the number of patterns in data structure. In this Zeke on Zeek post I’ll walk through moving paraglob to C++, and using different data structures to reduce its compile time to linear time and other operations to log(N) time.

But first, a cool looking graph summarizing some benchmarks I ran and a look ahead at the performance characteristics of a paraglob. “Queries” refers to how many strings are being matched and “patterns” refers to the number of patterns those queries are being matched against. I chose to have about 20% of the patterns match in this case. The small spikes aren’t consistent across runs, and are likely just my computer doing something else in the background. Notice how small the time increase is from running 1,000 to 20,000 queries. At the upper right paraglob is compiling a set of 10,000 patterns and running 20,000 queries on them in under 2 seconds.



The Algorithm


At its core paraglob is actually built around a relatively straightforward algorithm. For any pattern, there exists a set of words that an input must contain in order to have any hope of matching against it. For example, consider the pattern “do*”. Anything that matches against “do*” must at the very least contain the substring “do”. This can be easily extended to more complicated patterns by just breaking up the pattern on special glob syntax. For example, “dog*fish*cat” contains the substrings [“dog”, “fish”, “cat”]. We call these substrings “meta words”.

We can then reframe our problem as finding any of the meta words inside an input string and checking the patterns associated with those meta words against it. The Aho-Corasick string-searching algorithm coupled with a map from meta words to patterns solves our problem. We can summarize how paraglob works as follows:

CONSTRUCTION:
    for every input pattern:
         extract the meta words
         map them to their respective patterns
         store the meta words in the Aho-Corasick data structure

QUERYING:
    for every input string:
         get all the meta words it contains with the Aho-Corasick structure
         get candidate patterns with the map
         check those patterns against the string
         return the matches

Implementation


With the algorithm designed, paraglob’s actual implementation is fairly straightforward, but with a couple important nuances. The first lies in the fact that for a given set of patterns there is a non-zero chance that one meta word will be associated with multiple patterns. Consider for example a small pattern set [*mischiev[!o]us*, *mischevous*, *.us*, *.gov*] which might flag mischievous typosquatting and government related urls. Already this pattern set has one meta word (us) mapping to two quite different patterns.

As a result, paraglob can’t use a standard map structure which only allows for a single value for every key. The obvious solution to this is to use some sort of multimap, but in practice this proved to be unacceptably slow. Using a multimap slowed down paraglob by as much as a factor of 10 as opposed to an implementation with a standard map structure that ignores the above issue.

In order to achieve the performance offered by the latter, and still handle the association of multiple patterns with one meta word, paraglob uses a custom “ParaglobNode” class that can store a list of patterns and that is then associated with a meta word in a map. paraglobNodes also contain functionality to quickly merge patterns that they contain matching a string with an input vector. This greatly increases the speed at which paraglob is able to find patterns for an input string.

The second important nuance lies in how paraglob handles duplicate patterns. Using the same example pattern set as above, a query for “mischievious-url.uk” contains the meta words us, and mischiev. Mapping those to their respective pattern words, we get [*mischiev[!o]us*, *.us*] from us and [*mischiev[!o]us*] from mischiev. Initially it seems like we should keep these in a set so as to prevent checking the same pattern twice. As it turns out though, maintaining a set internally is much more expensive that just checking duplicate patterns and using vectors internally. The result of this is that a paraglob doesn’t remove any duplicates until the last step when the vector of matching patterns is at its smallest.

Inside Zeek


Paraglob is integrated with Zeek & provides a simple API inside of its scripting language. In Zeek, paraglob is implemented as an OpaqueType and its syntax closely follows other similar constructs inside Zeek. A paraglob can only be instantiated once from a vector of patterns and then only supports get operations which return a vector of all patterns matching an input string. The syntax is as follows:

local v = vector("*", "d?g", "*og", "d?", "d[!wl]g");
local p = paraglob_init(v);
print paraglob_get(p1, "dog");

Out:

[*, *og, d?g, d[!wl]g]

Paraglob also supports serialization, copy, and unserialization operations inside Zeek. This means that a paraglob can be sent to separate processes using Broker. Keep in mind though that copying a paraglob requires that it be recompiled and for very large paraglobs this can be an expensive operation.

While the absence of an add operation might seem strange, it stems from constraints that emerge in paraglob’s implementation. Adding a pattern to a paraglob that is already compiled requires that the paraglob be re-compiled because the Aho-Corasick tree has to be rebuilt. As a result, adding a pattern to a compiled paraglob takes the same amount of time as building a new paraglob from a vector of patterns.

While it seems reasonable that paraglob support both add and compile operations to get around this, I thought this was more likely to confuse than to provide much real benefit. People using paraglob without knowing about its performance characteristics might attempt to add to the paraglob in a loop or forget to compile it resulting in unexpectedly slow performance or errors.

With that said though, I certainly see an argument for extending the paraglob API to include add and compile operations. For use cases where there is an updating pattern set it would remove the need to keep track of a vector of patterns and a paraglob because the paraglob would maintain the vector of patterns itself. Under the hood paraglob already supports add and compile operations so adding those to Zeek would be as simple as extending ParaglobVal slightly and adding two functions to bro.bif.


Next Steps


A paraglob’s state is defined completely by the patterns inside of it. Paraglobs hold no internal state between calls, nor do they make any updates to their internal Aho-Corasick structure unless a new pattern is added. Presently, their serialization function takes advantage of this and only serializes the vector of patterns contained inside a paraglob. For unserializing, a new paraglob is built from that serialized vector of patterns, and its Aho-Corasick structure is recompiled. This recompilation is expensive though, and can take as long as 10 seconds for very long pattern sets.

Ideally, a paraglob could be serialized in such a way that the recompilation step is not needed. There exists some serialization code inside of the Boost C++ Libraries that might be useful in doing this, but due to how complicated the Aho-Corasick trie becomes when it contains a fair amount of patterns, serializing this would likely take a significant effort. Working out a clean way to serialize a Paraglob properly though would potentially result in a serious increase in its usefulness for distributing frequently changing pattern sets


Finally…


A huge thank you to Kamiar Kanani for his excellent Multifast Aho-Corasick implementation, which he allowed us to use under the BSD license for the Zeek project. Without such a well done string searching algorithm underpinning everything this would have been a much more difficult data structure to implement.


Contributed by: Zeke Medley - Website

Helpful Links and information:

Getting Involved: If you would like to be part of the Open Source Zeek Community and contribute to the success of the project please sign up for our mailing lists, join our IRC Channel, come to our events, follow the blog and/or Twitter feed. If you’re writing scripts or plugins for Zeek we would love to hear from you! Can’t figure out what your next step should be, just reach out. Together we can find a place for you to actively contribute and be a part of this growing community.

About Zeek (formerly Bro): Zeek is a powerful network analysis framework that is much different from the typical IDS you may know. https://www.zeek.org/

No comments:

Post a Comment