Luatime

Generating arbitrary test data

27 January 2015 2:30 PM (pflua | quickcheck | testing | Igalia)

Most tests are useless without data, and the data they need to run can be fairly complicated. Generating a lot of data by hand sounds tedious, and would get rid of a lot of the advantages of generating tests; the flexibility and speed of having a large test case would disappear. So, what does it look like to generate data for tests? Here is an example of generating pflua's AST, one of its intermediate representations, using a functional grammar; the complete code is available. The key thing to notice is that it is actually fairly easy to do this; it is less than a page of straightforward code, mainly choosing randomly between alternatives.

An AST is generated by running Logical(). The simplest alternatives are to generate a one-word AST, which can be 'true', 'false', or 'fail'. Alternatively, a conditional (with if) or a comparison (with a relative operator like '==') can be generated. But this text explanation is already longer than the code; here's the implementation of 'Logical':

function Logical()
   return choose({ Conditional, Comparison, True, False, Fail })()
end

To handle complications, like avoiding dividing by zero or accessing beyond the bounds of a packet, a Lua table ('db') also gets passed around to the arithmetic operations. 'Comparison' then unwraps the assertions, generating code to test if they're violated and failing if they are.

With the general concepts explained, without further ado, here is the actual code to generate random ASTs for tests.

function True() return { 'true' } end
function False() return { 'false' } end
function Fail() return { 'fail' } end
function ComparisonOp() return choose({ '<', '>' }) end
function BinaryOp() return choose({ '+', '-', '/' }) end
function UnaryOp()
   return choose({ 'uint32', 'int32', 'ntohs', 'ntohl' })
end
-- Boundary numbers are often particularly interesting; test them often
function Number()
   if math.random() < 0.2
      then return math.random(-2^31, 2^32 - 1)
   else
      return choose({ 0, 1, -2^31, 2^32-1, 2^31-1 })
   end
end
function Len() return 'len' end
function Unary(db) return { UnaryOp(), Arithmetic(db) } end
function Binary(db)
   local op, lhs, rhs = BinaryOp(), Arithmetic(db), Arithmetic(db)
   if op == '/' then table.insert(db, { '!=', rhs, 0 }) end
   return { op, lhs, rhs }
end
function PacketAccess(db)
   local pkt_access_size = choose({1, 2, 4})
   local position = {'uint32', Arithmetic(db) }
   table.insert(db, {'>=', 'len', {'+', position, pkt_access_size}})
   return { '[]', position, pkt_access_size }
end
function Arithmetic(db)
   return choose({ Unary, Binary, Number, Len, PacketAccess })(db)
end
function Comparison()
   local asserts = {}
   local expr = { ComparisonOp(), Arithmetic(asserts), Arithmetic(asserts) }
   while #asserts > 0 do
      expr = { 'if', table.remove(asserts), expr, { 'fail' } }
   end
   return expr
end
function Conditional() return { 'if', Logical(), Logical(), Logical() } end
function Logical()
   return choose({ Conditional, Comparison, True, False, Fail })()
end

The above approach is not perfect; it generates quite large test cases. 'choose' weights options equally, though Number() demonstrates how bias can be introduced if some cases should be tested more often than others. Nonetheless, it's a real-world example of something good enough to be useful, while being easy and flexible at the same time.

Improving Pflua's correctness with property-based checking

27 January 2015 2:30 PM (pflua | quickcheck | testing | Igalia)

Creating a property-based tester and fixing several bugs - in one afternoon

Pflua, Igalia's implementation of Libpcap's filter language, is fast, and works very well. Like most software, it contains a few bugs, which unusual circumstances can trigger. Pflua is largely compatible with a subset of Libpcap, but has a few minor differences. Igalia is actively fixing bugs in Pflua, and increasing Libpcap compatibility, so that our users have the best experience possible: we like to fix bugs before anyone else even finds them. Part of the way we are improving Pflua is with property-based testing.

Writing a lot of tests by hand is slow, boring, inflexible, and insufficient. It makes it harder to change the underlying code and try new things. And people tend not to think about the full range of edge cases. This is an even more serious problem when the same person writes the code and test cases, which is usually the case with Pflua - if you miss an edge case writing code, it is very easy to miss the same one while writing tests. Luckily, there is an alternative to writing hundreds of thousands of unit tests by hand. It is called "property based testing", and it lets you specify a "property" that should always be true, and then generates random test cases to try to falsify that property.

I love QuickCheck. Using it is one of the most enjoyable things about programming in Haskell. It is a fantastic tool for generating test cases that people tend not to think of. Pflua is written in Lua, not Haskell, so one afternoon, Andy Wingo and I sat down and wrote a simple program inspired by QuickCheck. Our program is not a port or clone of QuickCheck; it is a much simpler tool. Originally, it tested only one property - that the results of running a filter on a packet were the same before and after optimizing the Pflua IR for that filter. It is a very simple property, easily expressed in one line of code, but we found several flaws in Pflua, ranging from trivial to fairly important.

The simplest bug was due to not emitting spaces around binary operators. What is 8--2? In Lua, 8 - -2 is 10, as expected, but -- is a comment, so 8--2 is 8. In C11 or C++, it is like writing 8//2. After adding a space around each binary operator, that bug was eliminated. It was a trivial bug, but it produced silently incorrect results, and I am glad our users will never be bitten by it.

How does Pflua's property based testing work? There are a few steps. Pflua-test loops up to the number of times the test is to be run. Each time, it generates random data for the test case, and then checks if the test passes. Concretely, for testing if optimized and unoptimized IR give the same results, it generates and applies an arbitrary valid IR sequence to a random packet from a pcap dump. The function "Logical" generates a random Pflang filter. The test runs the optimizer on the IR, and compares the results of applying the optimized and unoptimized IR to the chosen packet. Originally, we did this in the main loop; later, we generalized the tool to take files that describe properties. This testing is not exhaustive; there are infinitely many IRs (and no effective way to collapse the state space), and the test could pass on the concrete packet it is run against but fail on other packets. Nonetheless, it is useful.

What is the idea behind the function "Logical"? Lua does not have an equivalent of Haskell typeclasses or 'instance Arbitrary', so we captured part of the spirit of those tools by setting up a functional grammar to generate arbitrary valid IR strings. The details are not essential, but I think it is worth sharing how little code it takes to gain this much flexibility. See the next post for the full implementation of the generator for the IR's grammar, in Lua.

So, what happened when we ran the code, testing the property that an IR string applied to a packet gives the same result before and after being optimized? Aside from the aforementioned comment/binary operator bug, the tool found a range-folding bug, a bug stemming from not setting the range of len, and a bug where Pflua generated invalid Lua (by returning, regardless of whether it was at the end of a block), as well as an invalid optimization involving ntohs and ntohl, and an internal crash with NaN.

Testing always takes time and effort, but property-based testing is often a lot more flexible and effective than writing tests by hand, and can be surprisingly quick to start using. Pflua's QuickCheck-inspired tool is still quite new and undergoing significant changes, but turning up half a dozen bugs in an afternoon was a good start. As usual, many bugs are trivial to fix as soon as their existance is known; several of the bugs were fixed within minutes of finding them.

A tale of sunk allocation

31 October 2014 10:51 PM (Igalia | pflua | optimization | IR | LuaJIT | performance | benchmarks)

Pflua is fast; Igalia wants it to be faster. When I joined Igalia, one of the big open questions was why some very similar workloads had extremely different speeds; matching a packet dump against a matching or non-matching host IP address could make the speed vary by more than a factor of 3! There was already a belief that allocation/trace/JIT interactions were a factor in this huge performance discrepancy, and it needed more investigation.

As a first step, I tracked down and removed some questionable allocations, which showed up in LuaJIT's trace dumps, but were not obvious from the source code. Removing unnecessary allocations from the critical path generally speeds up code, and LuaJIT comes with an option to dump its bytecode, internal representation, and the resulting assembly code, which gave some useful clues about where to start.

How were the results of removing unnecessary allocation? They were mixed, but they included double-digit performance improvements for some filters on some pcap dumps. "Portrange 0-6000" got 33% faster when benchmarked on the 1gb dump, and 96% faster on wingolog, nearly doubling in speed. The results were achieved by sinking all of the CNEWI allocations within the filter function of pflua-match - the performance gain was due to a small set of changes to one function. So, what was that change, and how was it made?

The first step was to see the traces generated by running pflua-match.

luajit -jdump=+rs ./pflua-match ~/path/wingolog.org.pcap tcp > trace1

The thing to look for is allocations that show up in LuaJIT's SSA IR. They are documented at http://wiki.luajit.org/SSA-IR-2.0#Allocations. Specifically, the important ones are allocations that are not "sunk" - that is, that have not already been optimized away. http://wiki.luajit.org/Allocation-Sinking-Optimization has more details about the allocations LuaJIT can sink). In practice, for the critical path of pflua-match, these were all CNEWI.

What is CNEWI? The allocation documentation tersely says "Allocate immutable cdata" and that CNEWI "is only used for immutable scalar cdata types. It combines an allocation with an initialization." What is cdata? The FFI documentation defines cdata as "a C data object. It holds a value of the corresponding ctype." So, all of the unsunk allocations have already been tracked down to something immutable related to the FFI at this point.

Pflua compiles 'pflang' (the nameless filtering language of libpcap) to Lua; tools like pflua-match run the generated Lua code, and LuaJIT compiles hot traces in both pflua-match itself and the generated Lua. Here's the start of what the generated Lua code for the "tcp" filter above looks like:

function match(P,length)
   if not (length >= 34) then return false end
   do
      local v1 = ffi.cast("uint16_t*", P+12)[0]
      if not (v1 == 8) then goto L3 end

Here is an illustrative excerpt from an IR trace of pflua-match:

0019 p64 ADD 0018 +12
0020 {sink} cdt CNEWI +170 0019
0021 > fun EQ 0014 ffi.cast
0022 {sink} cdt CNEWI +173 0019
0023 rbp u16 XLOAD 0019
.... SNAP #3 [ ---- ---- ---- 0023 ]
0024 > int EQ 0023 +8

A few things stand out. There are two CNEWI calls, and they are both sunk. Also, the constant 12 on the first line looks awfully familiar, as does the 8 on the last line: this is the IR representation of the generated code! The 0019 at the end of lines refers back to the line starting with 0019 at the end of the trace, rbp is a register, the +170 and +173 are an internal representation of C types, etc. For more details, read about the IR on LuaJIT's wiki.

This trace is not directly useful for optimization, because the CNEWI calls are already sunk (this is shown by the {sink}, rather than a register, proceeding them), but it illustrates the principles and tools involved. It also shows that raw data access can be done with sunk allocations, even when it involves the FFI.

Here is a somewhat more complex trace, showing first the LuaJIT bytecode, and then the LuaJIT SSA IR.

---- TRACE 21 start 20/8 "tcp":2
0004 . KPRI 2 1
0005 . RET1 2 2
0022 ISF 8
0023 JMP 9 => 0025
0025 ADDVN 3 3 0 ; 1
0026 MOV 0 7
0027 JMP 5 => 0003
0003 ISGE 0 1
0004 JMP 5 => 0028
0000 . . FUNCC ; ffi.meta.__lt
0005 JLOOP 5 20
---- TRACE 21 IR
0001 [8] num SLOAD #4 PI
0002 [10] num SLOAD #5 PI
0003 r14 p64 PVAL #21
0004 r13 p64 PVAL #59
0005 rbp p64 PVAL #64
0006 r15 + cdt CNEWI +170 0003
0007 {sink} cdt CNEWI +172 0003
0008 {sink} cdt CNEWI +170 0004
0009 [18] + cdt CNEWI +170 0005

There are 4 CNEWI calls; only two are sunk. The other two are ripe for optimization. The bytecode includes JLOOP, which turns out to be a big clue.

Here is the original implementation of pflua-match's filter harness:

local function filter(ptr, ptr_end, pred)
   local seen, matched = 0, 0
   while ptr < ptr_end do
      local record = ffi.cast("struct pcap_record *", ptr)
      local packet = ffi.cast("unsigned char *", record + 1)
      local ptr_next = packet + record.incl_len
      if pred(packet, record.incl_len) then
         matched = matched + 1
      end
      seen = seen + 1
      ptr = ptr_next
   end
   return seen, matched
end

It contains exactly one loop. There are less and less places that the unsunk CNEWI allocations could be hiding. The SSA IR documentation says that PVAL is a "parent value reference", which means it comes from the parent trace, and "TRACE 21 start 20/8" says that the parent trace of trace 21 is trace 20. Once again, constants provide a hint to how code lines up: the 16 is from local packet = ffi.cast("unsigned char *", record + 1), as 16 is the size of pcap_record.

0015 rdi p64 ADD 0013 +16
0017 {sink} cdt CNEWI +170 0015
0018 p64 ADD 0013 +8
0019 rbp u32 XLOAD 0018
0020 xmm2 num CONV 0019 num.u32
0021 rbp + p64 ADD 0019 0015

This is further confirmation that 'filter' is the function to be looking at. The reference to a parent trace itself is shown not to be the problem, because both a sunk and an unsunk allocation do it (here are the relevant lines from trace 21):

0003 r14 p64 PVAL #21
0006 r15 + cdt CNEWI +170 0003
0007 {sink} cdt CNEWI +172 0003

LuaJIT does not do classical escape analysis; it is quite a bit more clever, and sinks allocations which classical escape analysis cannot. http://wiki.luajit.org/Allocation-Sinking-Optimization discusses at length what it does and does not do. However, the filter function has an if branch in the middle that suffices to prevent the update to ptr from being sunk. One workaround is to store an offset to the pointer, which is a simple number shared between traces, and confine the pointer arithmetic to within one trace; it turns out that this is sufficient. Here is one rewritten version of the filter function which makes all of the CNEWI calls be sunk:

local function filter(ptr, max_offset, pred)
   local seen, matched, offset = 0, 0, 0
   local pcap_record_size = ffi.sizeof("struct pcap_record")
   while offset < max_offset do
      local cur_ptr = ptr + offset
      local record = ffi.cast("struct pcap_record *", cur_ptr)
      local packet = ffi.cast("unsigned char *", record + 1)
      offset = offset + pcap_record_size + record.incl_len
      if pred(packet, record.incl_len) then
         matched = matched + 1
      end
      seen = seen + 1
   end
   return seen, matched
end

How did removing the allocation impact performance? It turned out to be mixed: peak speed dropped, but so did variance, and some filters became a double digit percent faster. Here are the results for the 1gb pcap file that Igalia uses in some benchmarks, which shows large gains on the "tcp port 5555" and "portrange 0-6000" filters, and a much smaller loss of performance on the "tcp", "ip and "accept all" filters.

The original file:

The patched file, without CNEWI in the critical path:

Benchmarks on three other pcap dumps can be found at https://github.com/Igalia/pflua/issues/57. More information about the packet captures tested and Igalia's benchmarking of pflua in general is available at Igalia's pflua-bench repository on Github. There is also information about the workloads and the environment of the benchmarks.

There are several takeaways here. One is that examining the LuaJIT IR is extremely useful for some optimization problems. Another is that LuaJIT is extremely clever; it manages to make using the FFI quite cheap, and performs optimizations that common techniques cannot. The value of benchmarking, and how it can be misleading, both are highly relevant: a change that removed allocation from a tight loop also hurt performance in several cases, apparently by a large amount in some of the benchmarks on other pcap dumps. Lastly, inefficiency can be where it is least expected; while ptr < ptr_end looks like extremely innocent code, and is just a harness in a small tool script, but it was the cause of unsunk allocations in the critical path, which in turn were skewing benchmarks. What C programmer would suspect that line was at fault?