Compilation feature

Warning

This feature is fully implemented but may not be fully mature.

Overall

Construct 2.9 adds an experimental feature: compiling user made constructs into much faster (but less feature-rich) code. If you are familiar with Kaitai Struct, an alternative framework to Construct, Kaitai compiles yaml-based schemas into pure Python modules. Construct on the other hand, defines schemas in pure Python and compiles them into pure Python modules. Once you define a construct, you can use it to parse and build blobs without compilation. Compilation has only one purpose: performance.

It should be made clear that currently the compiler supports only parsing. Building and sizeof are deferred to original constructs, from which a compiled instance was made. Building support may be added in the future, depending on popularity of this feature. In that sense, perhaps the documentation should use the term “compiled parser” rather than “compiled construct”.

Requirements

Compilation feature requires Construct 2.9, preferrably the newest version to date. More importantly, you should have a test suite of your own. Construct aims to be reliable, but the compiler makes some undocumented assumptions, and generates a code that “takes shortcuts”. Since few checks are ommited by generated code, you should not use it to parse corrupted data.

Restrictions

Compiled classes only parse faster, building and sizeof defers to core classes

Sizeof is applied during compilation (not during parsing and building)

Exceptions do not include path information

Struct Sequence FocusedSeq Union LazyStruct do not support _subcons _stream context entries

Parsed hooks are not supported, ignored

Compiling schemas

Every construct (even those that do not compile) has a parameter-less compile method that returns also a construct (instance of Compiled class). It may be a good idea to compile something that is used for processing megabyte-sized data or millions of blobs. That compiled instance has parse and build methods just like the construct is was compiled from. Therefore, in your code, you can simply reassign the compiled instance over the original one.

>>> st = Struct("num" / Byte)
>>> st.parse(b"\x01")
Container(num=1)
>>> st = st.compile(filename="copyforinspection.py")
>>> st.parse(b"\x01")
Container(num=1)

Performance boost can be easily measured. This method also happens to be testing the correctness of the compiled parser, by making sure that both original and compiled instance parse into same results.

>>> print(st.benchmark(sampledata))
Timeit measurements:
parsing:           0.0000475557 sec/call
parsing compiled:  0.0000159182 sec/call
building:          0.0000591526 sec/call

Motivation

The code generated by compiler and core classes have essentially same functionality, but there is a noticable difference in performance. First half of performance boost is thanks to pre-processing, as shown in this chapter. Pre-processing means inserting constants instead of variable lookups, constants means just variables that are known at compile time. The second half is thanks to pypy. This chapter explains the performance difference by comparing Struct FormatField BytesInteger Bytes classes, including using the context. Example construct:

Struct(
    "num8" / Int8ub,
    "num24" / Int24ub,
    "data" / Bytes(this.num8),
)

Compiled parsing code:

def read_bytes(io, count):
    assert count >= 0
    data = io.read(count)
    assert len(data) == count
    return data
def parse_struct_1(io, this):
    this = Container(_ = this)
    try:
        this['num8'] = unpack('>B', read_bytes(io, 1))[0]
        this['num24'] = int.from_bytes(read_bytes(io, 3), byteorder='big', signed=False)
        this['data'] = read_bytes(io, this.num8)
    except StopIteration:
        pass
    del this['_']
    return this
def parseall(io, this):
    return parse_struct_1(io, this)
compiledschema = Compiled(None, None, parseall)

Non-compiled parsing code:

def _read_stream(stream, length):
    if length < 0:
        raise StreamError("length must be non-negative, found %s" % length)
    try:
        data = stream.read(length)
    except Exception:
        raise StreamError("stream.read() failed, requested %s bytes" % (length,))
    if len(data) != length:
        raise StreamError("could not read enough bytes, expected %d, found %d" % (length, len(data)))
    return data

class FormatField(Construct):
    def _parse(self, stream, context, path):
        data = _read_stream(stream, self.length)
        try:
            return struct.unpack(self.fmtstr, data)[0]
        except Exception:
            raise FormatFieldError("struct %r error during parsing" % self.fmtstr)

class BytesInteger(Construct):
    def _parse(self, stream, context, path):
        length = self.length(context) if callable(self.length) else self.length
        data = _read_stream(stream, length)
        if self.swapped:
            data = data[::-1]
        return bytes2integer(data, self.signed)

class Bytes(Construct):
    def _parse(self, stream, context, path):
        length = self.length(context) if callable(self.length) else self.length
        return _read_stream(stream, length)

class Renamed(Subconstruct):
    def _parse(self, stream, context, path):
        path += " -> %s" % (self.name,)
        return self.subcon._parse(stream, context, path)

class Struct(Construct):
    def _parse(self, stream, context, path):
        obj = Container()
        context = Container(_ = context)
        context._subcons = Container({sc.name:sc for sc in self.subcons if sc.name})
        for sc in self.subcons:
            try:
                subobj = sc._parse(stream, context, path)
                if sc.name:
                    obj[sc.name] = subobj
                    context[sc.name] = subobj
            except StopIteration:
                break
        return obj

There are several “shortcuts” that the compiled code does:

Function calls are relatively expensive, so an inlined expression is faster than a function returning the same exact expression. Therefore FormatField compiles into struct.unpack(…, read_bytes(io, …)) directly.

Literals like 1 and ‘>B’ are faster than object field lookup, dictionary lookup, or passing function arguments. Therefore each instance of FormatField compiles into a similar expression but with different format-strings and byte-counts inlined, usually literals.

Passing parameters to functions is slower than just referring to variables in same scope. Therefore, for example, compiled Struct creates “this” variable that is accessible to all expressions generated by subcons, as it exists in same scope, but core Struct would call subcon._parse and pass entire context as parameter value, regardless whether that subcon even uses a context (for example FormatField VarInt have no need for a context). Its similar but not exactly the same with “restream” function. The lambda in second parameter is rebounding io to a different object (a stream that gets created inside restream function). On the other hand, this is not rebounded, it exists in outer scope.

If statement (or conditional ternary operator) with two possible expressions and a condition that could be evaluated at compile-time is slower than just one or the other expression. Therefore, for example, BytesInteger does a lookup to check if field is swapped, but compiled BytesInteger simply inlines ‘big’ or ‘little’ literal. Moreover, Struct checks if each subcon has a name and then inserts a value into the context dictionary, but compiled Struct simply has an assignment or not. This shortcut also applies to most constructs, those that accept context lambdas as parameters. Generated classes do not need to check if a parameter is a constant or a lambda, because what gets emitted is either something like “1” which is a literal, or something like “this.field” which is an object lookup. Both are valid expressions and evaluate without red tape, or checks.

Looping over an iterable is slower than a block of code that accesses each item once. The reason its slower is that each iteration must fetch another item, and also check termination condition. Loop unrolling technique requires the iterable (or list rather) to be known at compile-time, which is the case with Struct and Sequence instances. Therefore, compiled Struct emits one line per subcon, but core Struct loops over its subcons.

Function calls that only defer to another function are only wasting CPU cycles. This relates specifically to Renamed class, which in compiled code emits same code as its subcon. Entire functionality of Renamed class (maintaining path information) is not supported in compiled code, where it would serve as mere subconstruct, just deferring to subcon.

Building two identical dictionaries is slower than building just one. Struct maintains two dictionaries (called obj and context) which differ only by _ key, but compiled Struct maintains only one dictionary and removes the _ key before returning it.

This expressions (not lambdas) are expensive to compute in regular code but something like “this.field” in a compiled code is merely one object field lookup. Same applies to len_ obj_ list_ expressions since they share the implementation with this expression.

Container is an implementation of so called AttrDict. It captures access to its attributes (field in this.field) and treats it as dictionary key access (this.field becomes this[“field”]). However, due to internal CPython drawbacks, capturing attribute access involves some red tape, unlike accessing keys, which is done directly. Therefore compiled Struct emits lines that assign to Container keys, not attributes.

Empirical evidence

The “shortcuts” that are described above are not much, but amount to quite a large portion of actual run-time. In fact, they amount to about a third (31%) of entire run-time. Note that this benchmark includes only pure-python compile-time optimisations.

Notice that results are in microseconds (10**-6).

-------------------------------- benchmark: 158 tests --------------------------------
Name (time in us)                                  Min                StdDev
--------------------------------------------------------------------------------------
test_class_array_parse                        284.7820 (74.05)       31.0403 (118.46)
test_class_array_parse_compiled                73.6430 (19.15)       10.7624 (41.07)
test_class_greedyrange_parse                  325.6610 (84.67)       31.8383 (121.50)
test_class_greedyrange_parse_compiled         300.9270 (78.24)       24.0149 (91.65)
test_class_repeatuntil_parse                   10.2730 (2.67)         0.8322 (3.18)
test_class_repeatuntil_parse_compiled           7.3020 (1.90)         1.3155 (5.02)
test_class_string_parse                        21.2270 (5.52)         1.3555 (5.17)
test_class_string_parse_compiled               18.9030 (4.91)         1.6023 (6.11)
test_class_cstring_parse                       10.9060 (2.84)         1.0971 (4.19)
test_class_cstring_parse_compiled               9.4050 (2.45)         1.6083 (6.14)
test_class_pascalstring_parse                   7.9290 (2.06)         0.4959 (1.89)
test_class_pascalstring_parse_compiled          6.6670 (1.73)         0.6601 (2.52)
test_class_struct_parse                        43.5890 (11.33)        4.4993 (17.17)
test_class_struct_parse_compiled               18.7370 (4.87)         2.0198 (7.71)
test_class_sequence_parse                      20.7810 (5.40)         2.6298 (10.04)
test_class_sequence_parse_compiled             11.9820 (3.12)         3.2669 (12.47)
test_class_union_parse                         91.0570 (23.68)       10.2126 (38.97)
test_class_union_parse_compiled                31.9240 (8.30)         3.5955 (13.72)
test_overall_parse                          3,200.7850 (832.23)     224.9197 (858.34)
test_overall_parse_compiled                 2,229.9610 (579.81)     118.2029 (451.09)
--------------------------------------------------------------------------------------

Motivation, part 2

The second part of optimisation is just running the generated code on pypy. Since pypy is not using any type annotations, there is nothing to discuss in this chapter. The benchmark reflects the same code as in previous chapter, but ran on Pypy 2.7 rather than CPython 3.6.

Empirical evidence

Notice that results are in nanoseconds (10**-9).

------------------------------------- benchmark: 152 tests ------------------------------------
Name (time in ns)                                      Min                     StdDev
-----------------------------------------------------------------------------------------------
test_class_array_parse                         11,042.9974 (103.52)       40,792.8559 (46.97)
test_class_array_parse_compiled                 9,088.0058 (85.20)        43,001.3909 (49.52)
test_class_greedyrange_parse                   14,402.0014 (135.01)       49,834.2047 (57.38)
test_class_greedyrange_parse_compiled           9,801.0059 (91.88)        39,296.4529 (45.25)
test_class_repeatuntil_parse                      318.4996 (2.99)          2,469.5524 (2.84)
test_class_repeatuntil_parse_compiled             309.3746 (2.90)        103,425.2134 (119.09)
test_class_string_parse                           966.8991 (9.06)        537,241.0095 (618.62)
test_class_string_parse_compiled                  726.6994 (6.81)          3,719.2657 (4.28)
test_class_cstring_parse                          782.2993 (7.33)          4,111.8970 (4.73)
test_class_cstring_parse_compiled                 591.1992 (5.54)        479,164.9746 (551.75)
test_class_pascalstring_parse                     465.0911 (4.36)          4,262.4397 (4.91)
test_class_pascalstring_parse_compiled            298.4118 (2.80)        122,279.2150 (140.80)
test_class_struct_parse                         2,633.9985 (24.69)        14,654.3095 (16.87)
test_class_struct_parse_compiled                  949.7991 (8.90)          4,228.2890 (4.87)
test_class_sequence_parse                       1,310.6008 (12.29)         5,811.8046 (6.69)
test_class_sequence_parse_compiled                732.2000 (6.86)          4,703.9483 (5.42)
test_class_union_parse                          5,619.9933 (52.69)        30,590.0630 (35.22)
test_class_union_parse_compiled                 2,699.9987 (25.31)        15,888.8206 (18.30)
test_overall_parse                          1,332,581.9891 (>1000.0)   2,274,995.4192 (>1000.0)
test_overall_parse_compiled                   690,380.0095 (>1000.0)     602,697.9721 (694.00)
-----------------------------------------------------------------------------------------------

Motivation, part 3

Warning

Benchmarks revealed that pypy makes the code run much faster than cython, therefore cython improvements were withdrawn, and compiler now generates pure python code that is compatible with Python 2 including pypy. This chapter is no longer relevant. It remained just for educational purposes.

This chapter talks about the second half of optimisation, which is due to Cython type annotations and type inference. I should state for the record, that I am no expert at Cython, and following explanatations are merely “the way I understand it”. Please take that into account when reading it. Fourth example:

Struct(
    "num1" / Int8ul,
    "num2" / Int24ul,
    "fixedarray1" / Array(3, Int8ul),
    "name1" / CString("utf8"),
)
cdef bytes read_bytes(io, int count):
    if not count >= 0: raise StreamError
    cdef bytes data = io.read(count)
    if not len(data) == count: raise StreamError
    return data
cdef bytes parse_nullterminatedstring(io, int unitsize, bytes finalunit):
    cdef list result = []
    cdef bytes unit
    while True:
        unit = read_bytes(io, unitsize)
        if unit == finalunit:
            break
        result.append(unit)
    return b"".join(result)
def parse_struct_1(io, this):
    this = Container(_ = this)
    try:
        this['num1'] = unpack('<B', read_bytes(io, 1))[0]
        this['num2'] = int.from_bytes(read_bytes(io, 3), byteorder='little', signed=False)
        this['fixedarray1'] = ListContainer((unpack('<B', read_bytes(io, 1))[0]) for i in range(3))
        this['name1'] = (parse_nullterminatedstring(io, 1, b'\x00')).decode('utf8')
        pass
    except StopIteration:
        pass
    del this['_']
    del this['_index']
    return this
def parseall(io, this):
    return parse_struct_1(io, this)
compiled = Compiled(None, None, parseall)

The primary cause of speedup in cython is this: if a variable is of known type, then operations on that variable can skip certain checks. If a variable is a pure python object, then those checks need to be added. A variable is considered of known type if either (1) its annotated like “cdef bytes data” or (2) its inferred like when using an annotated function call result like in “parse_nullterminatedstring(…).decode(…)” since “cdef bytes parse_nullterminatedstring(…)”. If a variable is known to be a list, then calling “append” on it doesnt require checking if that object has such a method or matching signature (parameters). If a variable is known to be a bytes, then “len(data)” can be compiled into bytes-type length function, not a general-purpose length function that works on arbitrary objects, and also “unit == finalunit” can be compiled into bytes-type equality. If a variable is known to be a unicode, then “.decode(‘utf8’)” can be compiled into str-type implementation. If cython knows that “struct.unpack” returns only tuples, then “…[0]” would compile into tuple-type getitem (index access). Examples are many, but the pattern is the same: type-specific code is faster than type-general code.

Second cause of speedup is due to special handling of integers. While most annotations like “cdef bytes” refer to specific albeit Python types, the “cdef int” actually does not refer to any Python type. It represents a C-integer which is allocated on the stack or in registers, unlike the other types which are allocated on the heap. All operations on C-integers are therefore much faster than on Python-integers. In example code, this affects “count >= 0” and “len(data) == count”.

Empirical evidence

Below micro-benchmarks show the difference between core classes and cython-compiled classes. Only those where performance boost was highest are listed (although they also happen to be the most important), some other classes have little speedup, and some have none.

Notice that results are in microseconds (10**-6).

------------------------------- benchmark: 152 tests -------------------------------
Name (time in us)                                  Min              StdDev
------------------------------------------------------------------------------------
test_class_array_parse                        286.5460 (73.85)     42.8831 (89.84)
test_class_array_parse_compiled                30.7200 (7.92)       6.9577 (14.58)
test_class_greedyrange_parse                  320.9860 (82.73)     45.9480 (96.26)
test_class_greedyrange_parse_compiled         262.7010 (67.71)     36.4504 (76.36)
test_class_repeatuntil_parse                   10.1850 (2.63)       2.4147 (5.06)
test_class_repeatuntil_parse_compiled           6.8880 (1.78)       1.5471 (3.24)
test_class_string_parse                        20.4400 (5.27)       4.4044 (9.23)
test_class_string_parse_compiled                9.1470 (2.36)       2.2427 (4.70)
test_class_cstring_parse                       11.2290 (2.89)       1.6216 (3.40)
test_class_cstring_parse_compiled               5.6080 (1.45)       1.0321 (2.16)
test_class_pascalstring_parse                   7.8560 (2.02)       1.8567 (3.89)
test_class_pascalstring_parse_compiled          5.8910 (1.52)       0.9466 (1.98)
test_class_struct_parse                        44.1300 (11.37)      6.8434 (14.34)
test_class_struct_parse_compiled               16.9070 (4.36)       3.0500 (6.39)
test_class_sequence_parse                      21.5420 (5.55)       2.6852 (5.63)
test_class_sequence_parse_compiled             10.1530 (2.62)       2.1645 (4.53)
test_class_union_parse                         91.9150 (23.69)     10.7812 (22.59)
test_class_union_parse_compiled                22.5970 (5.82)      15.2649 (31.98)
test_overall_parse                          2,126.2570 (548.01)   255.0154 (534.27)
test_overall_parse_compiled                 1,124.9560 (289.94)   127.4730 (267.06)
------------------------------------------------------------------------------------

Comparison with Kaitai Struct

Kaitai Struct is a very respectable competitor, so I believe a benchmark-based comparison should be presented. Construct and Kaitai have very different capabilities: Kaitai supports about a dozen languages, Construct only supports Python, Kaitai offers only basic common features, Construct offers python-only stuff like Numpy and Pickle support, Kaitai does only parsing, Construct does also building. In a sense, those libraries are in two different categories (like sumo and karate). There are multiple scenarios where either library would not be usable.

Example used for comparison:

Struct(
    "count" / Int32ul,
    "items" / Array(this.count, Struct(
        "num1" / Int8ul,
        "num2" / Int24ul,
        "flags" / BitStruct(
            "bool1" / Flag,
            "num4" / BitsInteger(3),
            Padding(4),
        ),
        "fixedarray1" / Array(3, Int8ul),
        "name1" / CString("utf8"),
        "name2" / PascalString(Int8ul, "utf8"),
    )),
)
meta:
  id: comparison_1_kaitai
  encoding: utf-8
  endian: le
seq:
  - id: count
    type: u4
  - id: items
    repeat: expr
    repeat-expr: count
    type: item
types:
  item:
    seq:
      - id: num1
        type: u1
      - id: num2_lo
        type: u2
      - id: num2_hi
        type: u1
      - id: flags
        type: flags
      - id: fixedarray1
        repeat: expr
        repeat-expr: 3
        type: u1
      - id: name1
        type: strz
      - id: len_name2
        type: u1
      - id: name2
        type: str
        size: len_name2
    instances:
      num2:
        value: 'num2_hi << 16 | num2_lo'
    types:
      flags:
        seq:
          - id: bool1
            type: b1
          - id: num4
            type: b3
          - id: padding
            type: b4

Suprisingly, Kaitai won the benchmark! Honestly, I am shocked and dismayed that it did. The only explanation that I can point out, is that Kaitai is parsing structs into class objects (with attributes) while Construct parses into dictionaries (with keys). However that one detail seems unlikely explanation for the huge discrepancy in benchmark results. Perhaps there is a flaw in the methodology. But until that is proven, Kaitai gets its respects. Congrats.

$ python3.6 comparison_1_construct.py
Timeit measurements:
parsing:           0.1024609069 sec/call
parsing compiled:  0.0410809368 sec/call

$ pypy comparison_1_construct.py
Timeit measurements:
parsing:           0.0108308416 sec/call
parsing compiled:  0.0062594243 sec/call
$ python3.6 comparison_1_kaitai.py
Timeit measurements:
parsing:           0.0250326035 sec/call

$ pypy comparison_1_kaitai.py
Timeit measurements:
parsing:           0.0019435351 sec/call