Friday, July 19, 2019

Zeke on Zeek: Working With Open-Source Zeek: Adding a Key-value For-Loop

By Zach Medley

Getting started working on Zeek can be daunting because of the sheer size of the repository. While designed reasonably, Zeek is big and a lot of reasonable design can still be a lot to handle. This blog post walks through how I added Zeek’s key-value for loop in the hope that it might make it easier for future Zeek developers to get started.

Zeek, formerly Bro, is an open-source network security monitoring tool that transforms raw traffic into rich logs, extracted files, and custom insights via a Turing-complete Zeek programming language. It’s all open source, and developed on GitHub with its community.

Defining the Problem


Before the addition of a key-value for loop in Zeek you can iterate over the items in a container with a standard range based for loop:



However, looping over tables where there are both keys and values requires a separate lookup:



This is less than ideal for both ergonomic and efficiency reasons. At its core, when Zeek does a lookup in a table, it retrieves the corresponding value as well as makes the second lookup unnecessary as Zeek user Jon points out below:




As for the syntax, Zeek’s tables can be indexed by tuples. The existing for loop supported iteration over tables with tuples by wrapping the keys in brackets and unpacking the tuple.



Christian suggested that we extend this tuple unpacking for use with key-value for loops.


Writing Tests


The testing framework Zeek uses is called btest and tests written using it are commonly called “btests.” Zeek's btests live in the testing/btest/ directory. Once you get the hang of them, they are pretty straightforward, but at first glance they can be a little confusing.

A btest usually consists of a test and a baseline. Btest works by running your test and comparing its output to a known baseline. A difference between the output and the baseline results in a failed test. In addition to cloning Zeek, you’ll need to install btest separately, as follows:

To get btest we suggest installing the development version. This will give you access to a more up-to-date btest version that the master version of Zeek may depend on. After cloning Zeek, move to the directory that it’s installed in and run:

     pip install -e aux/btest/

With btest installed, we can begin to write our tests. Zeek already has tests that cover for-loops in testing/btest/language/for.bro, so modifying that file is fine, but I chose to add a separate test file called key-value-for.bro. I wrote a couple tests for key-value for-loops and added one for iterating over tables with more than one index value because there wasn’t a test for that yet. My tests for the key-value look like this:




Note: It's important that your test has the # @TEST-EXEC … line on the top. If you don’t, btest won't know what command to use to run the test. In this case, our btest involves running Zeek on the following content, and a subsequent diff compares to our baseline of expected output.

With the test written, you’ll now have to add a baseline so that btest knows what the desired output should be. The best way to create a btest is fairly nebulous as there are many ways that will work well. Ultimately though, once you find a way you like, and as long as in the end you’re left with a working test, it’s likely fine.

The easiest way to create a simple btest is to replace the test script with some ad-hoc script that produces the same output. For the above we might replace it with some print statements that produce the desired output. Then you can go ahead and run the test with the -U parameter, which will prompt you to make a baseline. Once that’s done, don't forget to go back and change the script back to the one you want to test.

For more complicated tests, though, this ad-hoc method can get troublesome. Here, Christian suggests running the real test, letting it fail, then copying the “out” file it creates over to the baseline directory.

More or less in line with Christian’s suggestion, I created my btests by moving to the /btest/Baseline/ directory. Here I created a new folder with the name <the btest folder your test is in>.<the name of your test file>. For example, my tests were named key-value-for.bro and in the btest/language folder, so I added a folder to the btest/Baseline folder called language.key-value-for. Inside of your new folder add a file called out, and write whatever the expected output of your test is. My out file looks like this:



Now we can run our test and see if it fails. To run the test, first build and install Zeek by running

     ./configure

     make

     make install


Then, change back to the ./btest directory and run:

     btest -d language/key-value-for.bro

Writing Code


Adding new language functionality in Zeek can be done in a couple of simple steps:

Modify parse.y so that the new syntax is recognized and handled properly;

Write the underlying C++ code to make it all work. We’ll start by writing the code to parse the new for-loop.

Parsing


Zeek uses lex and yacc to generate its parser. The part that we’re concerned with can be found in src/parse.y. Specifically, we’re interested in the part that parses the for statement, underneath for_head:



I’ll walk through this code to give an overview of how it works, and then show the new parsing rules for a key-value for-loop.

TOK_FOR ‘(‘ TOK_ID TOK_IN expr ‘)’

Indicates the type of syntax that the following code deals with. Each of the tokens is represented below as a positional number, with TOK_FOR corresponding to the number 1 and ‘)’ corresponding to the number 6.

set_location(@1, @6);

When Zeek is parsed, objects can be associated with a location. For more information on the utility of this, see Bison’s page here. For a little more on how a location is represented, see src/Obj.h.

ID* loop_var = lookup_ID($3, current_module.c_str());

In this case, $3 refers to TOK_ID. Here we get loop_var’s previous definition if it already exists in the current module.





This is the meat of the parse phase. Here, if loop_var already has a definition, we make sure that it is not a global variable. Otherwise, we initialize it.

$$ = new ForStmt(loop_vars, $5);

Finally, we build a new for-statement, and $5, which refers to the thing we’re iterating through.

My implementation follows the basic for-loop’s parsing procedure very closely and calls an alternate version of the constructor that I’ll discuss next.




Core Functionality


In order to preserve as much of the original for-loop’s functionality as possible, I opted to write an alternate constructor for the for-loop that included a variable for values to be stored in as the loop moves through the table. The constructor first calls the regular for-loop constructor on the loop variables and expression, and then runs some additional code to verify the type of the value variable.

The most interesting part of the for-loop is the actual looping. This is done in the DoExec part of the for-loop in src/Stmt.cc.




We’re only interested in the part of the for-loop that deals with looping over tables because they are the only data type supported by key value for-loops. This code is mostly self explanatory with the exception of the usage of Ref() and Unref().

Zeek uses reference counting under the hood to clean up objects when they’re done being used. If you’re familiar with modern C++, this is the same way that shared_ptr works. Each object keeps track of how many references it has, if that number drops to zero, Zeek will clean it up. Whenever we’re setting an element in a frame we need to call Ref() on it. This increases the reference count in the frame, indicating that something needs to use that value until some time in the future when Unref() is called on it.

Keeping track of reference counting in Zeek can be quite difficult to get the hang of and lead to hard to track down bugs. Take care when using a value after passing it elsewhere and if you get a segfault, this is often the cause. Debuggers like gdb and tools like valgrind can be useful to help track down what it was that got deleted.

Conclusion


The addition of key-value for loops to Zeek make the process of iterating over a table simpler and more performant:



When possible, key-value for loops should be preferred to regular loops over tables.

If you’re interested in contributing to Zeek there is no bar to entry. For C and C++ people, the Zeek core is a great place to get your feet wet developing a scripting language. You can also get involved just writing Zeek. Much of Zeek is written in Zeek. Even if you don’t program much, I wrote the README so I’m sure it's got a couple spelling and grammar errors.
No matter how you do it, working on Zeek can be an incredibly rewarding experience. It's fun, challenging, educational, and keeps the world’s networks safe.


Helpful Links and information:

Getting Involved: If you would like to be part of the Open Source Zeek Community and contribute to the success of the project please sign up for our mailing lists, join our IRC Channel, come to our events, follow the blog and/or Twitter feed. If you’re writing scripts or plugins for Zeek we would love to hear from you! Can’t figure out what your next step should be, just reach out. Together we can find a place for you to actively contribute and be a part of this growing community.

About Zeek (formerly Bro): Zeek is a powerful network analysis framework that is much different from the typical IDS you may know. https://www.zeek.org/

No comments:

Post a Comment