Thursday, June 20, 2019

Open Source Zeek - Strategic Community Goals

“Coming together is a beginning, staying together is progress, and working together is success.” 
~ Henry Ford

To all members of the Zeek community: today I’m excited to share the strategic goals I’ll be pursuing over the next year. As a reminder, I joined Corelight as Director of Community for the Zeek project a few months ago. I developed the following list after learning about the community and evaluating where it is, talking to many of you, and gathering feedback from the Zeek Leadership Team and the Corelight Founders.

Please understand, this is only a beginning. I’ll be working on other goals in the future, and would like to get your input on what you most need. But based on my prior experience supporting community efforts in the Ubuntu and Open Compute Projects, it’s often helpful to get started with infrastructure, awareness, engagement, and governance. As we work on these items, I am sure other actionable goals will move onto my plate.

If you or your organization would like to help with any of these goals or if you have questions, comments, feedback of any kind please feel free to reach out and let me know.

I look forward to collaborating with you all. Here’s to stronger communities, safer networks and many successes as we work together!

Community Goals


1. Increase Zeek Awareness - We need to drive greater awareness of Zeek in the cybersecurity / threat hunting / detection ecosystems, while also targeting adjacent open source technologies. To this end, we will:
  • Deliver a monthly newsletter (Including Zeek news/tutorials, other security news, notable CVEs, etc.)
  • Produce an editorial calendar for 2019, to include:
    • Monthly content cadence (tutorials and articles)
    • Information about new releases (including notes/demos)
    • Document editorial process  (for soliciting external contributions)
    • Rewards and incentives (for contributors)

2. Increase Engagement with the Zeek Community - We need more online and in-person engagement opportunities for the Zeek community, because there are many ways to contribute and get involved. To accomplish this, we will seek to have the following:
  • A predictable cadence of in-person meetups, training opportunities, and other events to meet and engage with the community.
  • Engagements and partnerships with adjacent technology communities.
  • Updated / reorganized documentation, tutorials as well as support channels.
  • A calendar of events for the community.
  • Definition for each major type of contribution (what tasks, what skills, what is success and how to reward and retain contributors).

3. Update Zeek Infrastructure - Last year the project was renamed Zeek (formerly Bro). Once a new logo is finalized, we need to rebrand, update, and reorganize the website - with the aim of creating a clean, easy to navigate and intuitive home where Zeek users and developers of all skill levels can go to gain knowledge and know-how. This will help us:
  • Increase brand credibility (making the website convey the same high quality of Zeek project code)
  • Gain community contributors and users (participation is a cornerstone to all successful communities)
  • Encourage contributions and project innovation

4. Design Governance Structure
- While we already have a Zeek Leadership Team (LT) and core committers, we don’t have a system that defines how people can move into either of those roles. This work will be broken down into two phases.
  • Phase 1
    • Shed more light onto the decision making process and publish notes after each LT and Zeek community meeting
    • Solicit input from the Zeek community.
  • Phase 2
    • Define processes for how to become part of the leadership and decision making bodies.

Again, thank you so much for being part of the Zeek Community!! 


Helpful Links and information:

Getting Involved: If you would like to be part of the Open Source Zeek Community and contribute to the success of the project please sign up for our mailing lists, join our IRC Channel, come to our events, follow the blog and/or Twitter feed. If you’re writing scripts or plugins for Zeek we would love to hear from you! Can’t figure out what your next step should be, just reach out. Together we can find a place for you to actively contribute and be a part of this growing community.

About Zeek (formerly Bro): Zeek is a powerful network analysis framework that is much different from the typical IDS you may know. https://www.zeek.org/

Zeke on Zeek: Paraglob

Paraglob is a data structure for quick string matching against a large set of patterns. It was originally designed by Robin Sommer, but an early, experimental implementation was slowed significantly by an internal set data structure that ran in linear time for most of its operations. As a result of a couple of these linear time operations being called together, building a paraglob took O(N2) and other operations took O(Nlog(N) time where N is the number of patterns in data structure. In this Zeke on Zeek post I’ll walk through moving paraglob to C++, and using different data structures to reduce its compile time to linear time and other operations to log(N) time.

But first, a cool looking graph summarizing some benchmarks I ran and a look ahead at the performance characteristics of a paraglob. “Queries” refers to how many strings are being matched and “patterns” refers to the number of patterns those queries are being matched against. I chose to have about 20% of the patterns match in this case. The small spikes aren’t consistent across runs, and are likely just my computer doing something else in the background. Notice how small the time increase is from running 1,000 to 20,000 queries. At the upper right paraglob is compiling a set of 10,000 patterns and running 20,000 queries on them in under 2 seconds.



The Algorithm


At its core paraglob is actually built around a relatively straightforward algorithm. For any pattern, there exists a set of words that an input must contain in order to have any hope of matching against it. For example, consider the pattern “do*”. Anything that matches against “do*” must at the very least contain the substring “do”. This can be easily extended to more complicated patterns by just breaking up the pattern on special glob syntax. For example, “dog*fish*cat” contains the substrings [“dog”, “fish”, “cat”]. We call these substrings “meta words”.

We can then reframe our problem as finding any of the meta words inside an input string and checking the patterns associated with those meta words against it. The Aho-Corasick string-searching algorithm coupled with a map from meta words to patterns solves our problem. We can summarize how paraglob works as follows:

CONSTRUCTION:
    for every input pattern:
         extract the meta words
         map them to their respective patterns
         store the meta words in the Aho-Corasick data structure

QUERYING:
    for every input string:
         get all the meta words it contains with the Aho-Corasick structure
         get candidate patterns with the map
         check those patterns against the string
         return the matches

Implementation


With the algorithm designed, paraglob’s actual implementation is fairly straightforward, but with a couple important nuances. The first lies in the fact that for a given set of patterns there is a non-zero chance that one meta word will be associated with multiple patterns. Consider for example a small pattern set [*mischiev[!o]us*, *mischevous*, *.us*, *.gov*] which might flag mischievous typosquatting and government related urls. Already this pattern set has one meta word (us) mapping to two quite different patterns.

As a result, paraglob can’t use a standard map structure which only allows for a single value for every key. The obvious solution to this is to use some sort of multimap, but in practice this proved to be unacceptably slow. Using a multimap slowed down paraglob by as much as a factor of 10 as opposed to an implementation with a standard map structure that ignores the above issue.

In order to achieve the performance offered by the latter, and still handle the association of multiple patterns with one meta word, paraglob uses a custom “ParaglobNode” class that can store a list of patterns and that is then associated with a meta word in a map. paraglobNodes also contain functionality to quickly merge patterns that they contain matching a string with an input vector. This greatly increases the speed at which paraglob is able to find patterns for an input string.

The second important nuance lies in how paraglob handles duplicate patterns. Using the same example pattern set as above, a query for “mischievious-url.uk” contains the meta words us, and mischiev. Mapping those to their respective pattern words, we get [*mischiev[!o]us*, *.us*] from us and [*mischiev[!o]us*] from mischiev. Initially it seems like we should keep these in a set so as to prevent checking the same pattern twice. As it turns out though, maintaining a set internally is much more expensive that just checking duplicate patterns and using vectors internally. The result of this is that a paraglob doesn’t remove any duplicates until the last step when the vector of matching patterns is at its smallest.

Inside Zeek


Paraglob is integrated with Zeek & provides a simple API inside of its scripting language. In Zeek, paraglob is implemented as an OpaqueType and its syntax closely follows other similar constructs inside Zeek. A paraglob can only be instantiated once from a vector of patterns and then only supports get operations which return a vector of all patterns matching an input string. The syntax is as follows:

local v = vector("*", "d?g", "*og", "d?", "d[!wl]g");
local p = paraglob_init(v);
print paraglob_get(p1, "dog");

Out:

[*, *og, d?g, d[!wl]g]

Paraglob also supports serialization, copy, and unserialization operations inside Zeek. This means that a paraglob can be sent to separate processes using Broker. Keep in mind though that copying a paraglob requires that it be recompiled and for very large paraglobs this can be an expensive operation.

While the absence of an add operation might seem strange, it stems from constraints that emerge in paraglob’s implementation. Adding a pattern to a paraglob that is already compiled requires that the paraglob be re-compiled because the Aho-Corasick tree has to be rebuilt. As a result, adding a pattern to a compiled paraglob takes the same amount of time as building a new paraglob from a vector of patterns.

While it seems reasonable that paraglob support both add and compile operations to get around this, I thought this was more likely to confuse than to provide much real benefit. People using paraglob without knowing about its performance characteristics might attempt to add to the paraglob in a loop or forget to compile it resulting in unexpectedly slow performance or errors.

With that said though, I certainly see an argument for extending the paraglob API to include add and compile operations. For use cases where there is an updating pattern set it would remove the need to keep track of a vector of patterns and a paraglob because the paraglob would maintain the vector of patterns itself. Under the hood paraglob already supports add and compile operations so adding those to Zeek would be as simple as extending ParaglobVal slightly and adding two functions to bro.bif.


Next Steps


A paraglob’s state is defined completely by the patterns inside of it. Paraglobs hold no internal state between calls, nor do they make any updates to their internal Aho-Corasick structure unless a new pattern is added. Presently, their serialization function takes advantage of this and only serializes the vector of patterns contained inside a paraglob. For unserializing, a new paraglob is built from that serialized vector of patterns, and its Aho-Corasick structure is recompiled. This recompilation is expensive though, and can take as long as 10 seconds for very long pattern sets.

Ideally, a paraglob could be serialized in such a way that the recompilation step is not needed. There exists some serialization code inside of the Boost C++ Libraries that might be useful in doing this, but due to how complicated the Aho-Corasick trie becomes when it contains a fair amount of patterns, serializing this would likely take a significant effort. Working out a clean way to serialize a Paraglob properly though would potentially result in a serious increase in its usefulness for distributing frequently changing pattern sets


Finally…


A huge thank you to Kamiar Kanani for his excellent Multifast Aho-Corasick implementation, which he allowed us to use under the BSD license for the Zeek project. Without such a well done string searching algorithm underpinning everything this would have been a much more difficult data structure to implement.


Contributed by: Zeke Medley - Website

Helpful Links and information:

Getting Involved: If you would like to be part of the Open Source Zeek Community and contribute to the success of the project please sign up for our mailing lists, join our IRC Channel, come to our events, follow the blog and/or Twitter feed. If you’re writing scripts or plugins for Zeek we would love to hear from you! Can’t figure out what your next step should be, just reach out. Together we can find a place for you to actively contribute and be a part of this growing community.

About Zeek (formerly Bro): Zeek is a powerful network analysis framework that is much different from the typical IDS you may know. https://www.zeek.org/

Tuesday, June 11, 2019

Open Source Zeek Leadership Team Meeting Minutes - 31 May 2019



The open source Zeek project Leadership Team (LT) is made up of contributors from multiple organizations throughout the community. The LT acts as both a technical steering committee and governance body. You can find out more about the LT on the team page of the website.

Below are the notes from the LT meeting held on 31 May 2019.


Zeek.org Leadership Team Members (Bold indicates attendance)

  • Keith Lehigh (Chair), Indiana University
  • Johanna Amann, International Computer Science Institute/Corelight/Lawrence Berkeley National Laboratory
  • Seth Hall, Corelight
  • Vern Paxson, Corelight & University of California at Berkeley
  • Michal Purzynski, Mozilla Foundation
  • Aashish Sharma, Lawrence Berkeley Lab
  • Adam Slagell, ESnet
  • Robin Sommer, Corelight

  • Amber Graner*, Corelight, Director of Community for the Open Source Zeek Community
         *not a member

Agenda

  • Trademark Discussion  (Amber)
  • Keynotes  (Keith)
  • Zeek Package Contest (Amber)
  • Analytics Discussion Scheduling (Keith)

Minutes

  • Trademark Discussion - The LT Discussed the current Name and Logo Usage Statement - https://www.zeek.org/documentation/marks.html Out of the discussion came the following action items to look into:
    • Create a Reciprocal Logo Usage Agreement
    • Update the Marks Usage Documentation
    • Create a standard Cease and Desist letter
  • Keynotes - LT Members will continue reaching out to potential keynote speakers for ZeekWeek 2019.
  • Zeek Package Contest - Amber brought up the Zeek Package Contest that Corelight would like to host leading up to ZeekWeek 2019. Amber to take LT feedback to the Corelight team and present the details of the program at the next LT meeting.
  • Analytics Discussion Scheduling - Keith to scheduling an additional LT meeting to discuss analytics tools for the website.

Helpful Links and information:


Getting Involved: If you would like to be part of the Open Source Zeek Community and contribute to the success of the project please sign up for our mailing lists, join our IRC Channel, come to our events, follow the blog and/or Twitter feed. If you’re writing scripts or plugins for Zeek we would love to hear from you! Can’t figure out what your next step should be, just reach out. Together we can find a place for you to actively contribute and be a part of this growing community.
About Zeek (formerly Bro): Zeek is a powerful network analysis framework that is much different from the typical IDS you may know. https://www.zeek.org/

Wednesday, June 5, 2019

People of Zeek Interview Series - Introducing Zeke Medley and Zeke on Zeek

Amber Graner (AG): Hi Zeke. Thank you so much for taking the time to answer my questions and let the community know who you are and what Zeek related items you’re working on.

Zeke Medley (ZM): Hi Amber :-)

AG: Zeke can you take a moment to tell people a little about yourself and what you’re doing for the Open Source Zeek Project?


ZM: I started getting interested in programming in 7th grade when I wrote a tiny rock-paper-scissors program over the summer. Since then, I’ve remained fairly interested in rock-paper-scissors, but have branched out a little bit. My first introduction to network security was probably freshman year of high school when one of my friends figured out that he could remotely open disk drives in our schools computer labs with the command line and I wrote him a little script to do it for all the computers in a lab. These days I’m a freshman studying Electrical Engineering and Computer Sciences at Berkeley and also working in a makerspace on campus called the Invention Lab.

On the Zeek side I’m wrapping up work on a data structure to match a string against a large set of patterns that Robin started a while ago and I just finished adding key-value for loops to the Zeek scripting language. Moving forward I hope to stay involved in the open source project, and we’ll see what projects I end up working on.

AG: What drew you to Zeek and how did you get involved with the project?

ZM: My name being Zeke definitely made it stand out to me, but I was actually first introduced to it when I met Christian at a career fair. He seemed like a really nice guy and the whole project was right in line with my interests. I made my first pull request adding some basic string functions to the language and the rest is recent history.

AG: What’s the most interesting thing you’ve learned about Zeek so far?

ZM: At first I was pretty intimidated by just how big Zeek is. There is a lot going on and it's a fairly complex program. The more I’ve learned about it though the better I think it's designed. Zeek is very extensible. Once you get the hang of it .bif files make adding new functionality to the language pretty fun and straightforward.

AG: Can you tell the community about the “Zeke on Zeek” series we’ll be starting soon and what they can expect to read about?

ZM: “Zeke on Zeek” is a series of blog posts we’ve been talking about pretty much since I got started that I hope will offer some sort of roadmap for people getting started working on Zeek. Zeek is a big project and putting together how it all works can be pretty challenging at times, so I’ll be laying out my experience in the hope that it can help other people interested in contributing to the project.

AG: For those who are thinking about interning for the first time, can you share some things you’ve learned or are learning about how to balance your time between school, your internship, and personal projects?

ZM: I know it sounds silly, but I genuinely enjoy the vast majority of what I do. School can be really challenging at times and making anything, be it a data structure or drone, seems to be more of a process of learning from repeated failure than actually creating anything that works, but I think there is something profound about that. In my (albeit rather limited) experience the more comfortable I become with failure the easier things get.

AG: Is there anything that you’d like to share about yourself of Zeek that I haven’t asked you about?

ZM: I’ve been really floored by Zeek and its community because insofar as I can tell they seems to be genuinely out to do good for the world. Not only is the whole project open source, it's also out to help solve pressing problems we have with network security these days.


Helpful Links and information:

Getting Involved: If you would like to be part of the Open Source Zeek Community and contribute to the success of the project please sign up for our mailing lists, join our IRC Channel, come to our events, follow the blog and/or Twitter feed. If you’re writing scripts or plugins for Zeek we would love to hear from you! Can’t figure out what your next step should be, just reach out. Together we can find a place for you to actively contribute and be a part of this growing community.

About Zeek (formerly Bro): Zeek is a powerful network analysis framework that is much different from the typical IDS you may know. https://www.zeek.org/