Compilation feature

Warning

This feature is mostly implemented but still lacking Cython, which would make it a lot faster. Consider this a work in progress.

Overall

Construct 2.9 adds an experimental feature: compiling user made constructs into faster (but less feature-rich) code. If you are familiar with Kaitai Struct, an alternative framework to Construct, Kaitai compiles yaml-based schemas into Python modules. Construct on the other hand, defines schemas in Python and compiles them into also Python modules. Once you define a construct, you can use it to parse and build strings without compilation. Compilation has only one purpose: performance. Compiled constructs cannot do anything more than original constructs, in fact, they have restricted capabilities (some classes do not compile, or compile only under certain circumstances).

It should be made clear that currently the compiler supports only parsing. Building and sizeof are deferred to original constructs, from which a compiled instance was made. Building support may be added in the future. In that sense, perhaps the documentation should use the term “compiled parser” rather than “compiled construct”.

Requirements

Compilation feature requires Construct 2.9 and Python 3.6. More importantly, you should manually inspect the generated code for correctness and have a test suite of your own. Construct aims to be reliable, but the compiler makes some undocumented assumptions, and generates a code that “takes shortcuts”.

Restrictions

Warning

These items are (probably) permanent, but feel free to request changes.

Compiled classes only parse faster, building and sizeof defers to core classes

AssertionError and standard hierarchy exceptions can be raised (core classes are resticted to ConstructError-derivatives)

Lambdas (unlike this expressions) are not compilable

Sizeof is applied during compilation (not during parsing and building)

Some classes require fixed-sized subcons (otherwise raise NotImplementedError if compiled)

Some classes require constant selectors like FocusedSeq Union (otherwise raise NotImplementedError if compiled)

Exceptions do not include path information

Adapters and validators are in general not compilable

TransformBytes Restreamed are in general not compilable (except Bitwise Bytewise BytesSwapped BitsSwapped)

Checksum LazyBound are not compilable, because they use lambdas (not this expressions)

Compiling schemas

Every construct (even those that do not compile) has a parameter-less compile method that returns also a construct (instance of Compiled class). Note that compiling takes a substantial amount of time (compared to parsing a single blob) so it may be a good idea to compile something that is used for processing giga-sized data or multiple blobs of data, but should not be overused. That compiled instance has parse and build methods just like the construct that was compiled. You can simply reassign the compiled instance over the original.

>>> st = Struct("num" / Byte)
>>> st.parse(b"\x01")
Container(num=1)
>>> st = st.compile()
>>> st.parse(b"\x01")
Container(num=1)

Compiled instance has a unique source field that holds the generated code as string and also has a unique method tofile that saves the generated source code to a file, for your inspection for example. You can also import such a module from a Python script.

>>> st.source
"... schema code ..."
>>> st.tofile("schema1.py")
>>> import schema1

Performance boost can be easily measured:

>>> print(st.benchmark(sampledata))
Timeit measurements:
compiling:         0.00048725021999416638 sec/call
parsing:           0.00002787889000046562 sec/call
parsing compiled:  0.00001943664999998873 sec/call
building:          0.00003316365799946653 sec/call
building compiled: 0.00003364123299979837 sec/call

Correctness can be automatically tested. The method checks that the provided (original) construct and its equivalent compiled version, both parse and build into equal data and objects. Object used for testing building is obtained by parsing the sample data.

>>> st.testcompiled(sampledata)

Motivation

The code generated by compiler and core classes have essentially same functionality, but there is a noticable difference in performance. This chapter explains the difference by comparing Struct FormatField BytesInteger Bytes classes, including using a context. Example construct:

Struct(
    "num8" / Int8ub,
    "num24" / Int24ub,
    "data" / Bytes(this.num8),
)

Compiled parsing code:

def read_bytes(io, count):
    assert count >= 0
    data = io.read(count)
    assert len(data) == count
    return data
def parse_struct_1(io, this):
    this = Container(_ = this)
    try:
        this['num8'] = unpack('>B', read_bytes(io, 1))[0]
        this['num24'] = int.from_bytes(read_bytes(io, 3), byteorder='big', signed=False)
        this['data'] = read_bytes(io, this.num8)
    except StopIteration:
        pass
    del this._
    return this
def parseall(io, this):
    return parse_struct_1(io, this)
compiledschema = Compiled(None, None, parseall)

Non-compiled parsing code:

def _read_stream(stream, length):
    if length < 0:
        raise StreamError("length must be non-negative, found %s" % length)
    try:
        data = stream.read(length)
    except Exception:
        raise StreamError("stream.read() failed, requested %s bytes" % (length,))
    if len(data) != length:
        raise StreamError("could not read enough bytes, expected %d, found %d" % (length, len(data)))
    return data

class FormatField(Construct):
    def _parse(self, stream, context, path):
        data = _read_stream(stream, self.length)
        try:
            return struct.unpack(self.fmtstr, data)[0]
        except Exception:
            raise FormatFieldError("struct %r error during parsing" % self.fmtstr)

class BytesInteger(Construct):
    def _parse(self, stream, context, path):
        length = self.length(context) if callable(self.length) else self.length
        data = _read_stream(stream, length)
        if self.swapped:
            data = data[::-1]
        return bytes2integer(data, self.signed)

class Bytes(Construct):
    def _parse(self, stream, context, path):
        length = self.length(context) if callable(self.length) else self.length
        return _read_stream(stream, length)

class Renamed(Subconstruct):
    def _parse(self, stream, context, path):
        try:
            path += " -> %s" % (self.name,)
            return self.subcon._parse(stream, context, path)
        except ConstructError as e:
            if "\n" in str(e):
                raise
            raise e.__class__("%s\n    %s" % (e, path))

class Struct(Construct):
    def _parse(self, stream, context, path):
        obj = Container()
        context = Container(_ = context)
        for sc in self.subcons:
            try:
                subobj = sc._parse(stream, context, path)
                if sc.name:
                    obj[sc.name] = subobj
                    context[sc.name] = subobj
            except StopIteration:
                break
        return obj

There are several “shortcuts” that the compiled code does:

Function calls are relatively expensive, so an inlined expression is faster than a function returning the same exact expression. Therefore FormatField compiles into struct.unpack(…, read_bytes(io, …)) directly.

Literals like 1 and ‘>B’ are faster than object field lookup, dictionary lookup, or passing function arguments. Therefore each instance of FormatField compiles into a similar expression but with different format-strings and byte-counts inlined, usually literals.

Passing parameters to functions is slower than just referring to variables in same scope. Therefore, for example, compiled Struct creates “this” variable that is accessible to all expressions generated by subcons, as it exists in same scope, but core Struct would call subcon._parse and pass entire context as parameter value, regardless whether that subcon even uses a context (for example FormatField VarInt have no need for a context). Its similar but not exactly the same with “restream” function. The lambda in second parameter is rebounding io to a different object (a stream that gets created inside restream function). On the other hand, this is not rebounded, it exists in outer scope.

If statement (or conditional ternary operator) with two possible expressions and a condition that could be evaluated at compile-time is slower than just one or the other expression. Therefore, for example, BytesInteger does a lookup to check if field is swapped, but compiled BytesInteger simply inlines ‘big’ or ‘little’ literal. Moreover, Struct checks if each subcon has a name and then inserts a value into the context dictionary, but compiled Struct simply has an assignment or not. This shortcut also applies to most constructs, those that accept context lambdas as parameters. Generated classes do not need to check if a parameter is a constant or a lambda, because what gets emitted is either something like “1” which is a literal, or something like “this.field” which is an object lookup. Both are valid expressions and evaluate without red tape, or checks.

Looping over an iterable is slower than a block of code that accesses each item once. The reason its slower is that each iteration must fetch another item, and also check termination condition. Loop unrolling technique requires the iterable (or list rather) to be known at compile-time, which is the case with Struct and Sequence instances. Therefore, compiled Struct emits one line per subcon, but core Struct loops over its subcons.

Function calls that only defer to another function are only wasting CPU cycles. This relates specifically to Renamed class, which in compiled code emits same code as its subcon. Entire functionality of Renamed class (maintaining path information) is not supported in compiled code, where it would serve as mere subconstruct, just deferring to subcon.

Building two identical dictionaries is slower than building just one. Struct maintains two dictionaries (called obj and context) which differ only by _ key, but compiled Struct maintains only one dictionary and removes the _ key before returning it.

This expressions (not lambdas) are expensive to compute in regular code but something like “this.field” in a compiled code is merely one object field lookup. Same applies to len_ obj_ list_ expressions since they share the implementation with this expression.

Container is an implementation of so called AttrDict. It captures access to its attributes (field in this.field) and treats it as dictionary key access (this.field becomes this[“field”]). However, due to internal CPython drawbacks, capturing attribute access involves some red tape, unlike accessing keys, which is done directly. Therefore compiled Struct emits lines that assign to Container keys, not attributes.

Second example, discussing decompiled instances:

Struct(
    "field1" / Int8ub,
    "field2" / If(this.field1 == 0, Int8ub),
    "field3" / If(this.field1 == 0, RawCopy(Int8ub)),
    "field4" / RawCopy(Int8ub),
    "field5" / RawCopy(GreedyRange(Int8ub)),
)
decompiled_4 = Decompiled(lambda io,this: unpack('>B', read_bytes(io, 1))[0])
decompiled_2 = RawCopy(decompiled_4)
decompiled_5 = RawCopy(decompiled_4)
decompiled_7 = GreedyRange(decompiled_4)
decompiled_6 = RawCopy(decompiled_7)
def parse_struct_1(io, this):
    this = Container(_ = this)
    try:
        this['field1'] = unpack('>B', read_bytes(io, 1))[0]
        this['field2'] = (unpack('>B', read_bytes(io, 1))[0]) if ((this.field1 == 0)) else (None)
        this['field3'] = (decompiled_2._parse(io, this, None)) if ((this.field1 == 0)) else (None)
        this['field4'] = decompiled_5._parse(io, this, None)
        this['field5'] = decompiled_6._parse(io, this, None)
        pass
    except StopIteration:
        pass
    del this._
    return this
def parseall(io, this):
    return parse_struct_1(io, this)
compiledschema = Compiled(None, None, parseall)

Regular constructs use a different model than generated code. In regular code, every subcon is an instance of Construct class, so to sub-parse, outer construct calls subcon._parse(), that is a method on another instance. In genereted code, subcon parser is a Python expression (one-liner), that gets embedded in outer construct’s parser, which usually is also a Python expression. This eliminates an overhead of a function call. For example, IfThenElse and FormatField both compile into expressions, one embedded into the other.

Not all constructs have compilable parsers. Those instances that can be represented by a Python expression are called “compilable”, like FormatField and Bytes. Those that can be represented by a re-created core class are called “decompilable”, like GreedyRange and RawCopy. Almost all classes are either of the two. Few classes are neither, like Compressed and Restreamed, and therefore cannot exist in compiled code. The reason for those “decompilable” classes is that they either have too much code or do too heavy work, to justify writing compiled parsers for them.

If a compilable instance gets compiled (eg. FormatField inside IfThenElse) it tries to obtain a Python expression of its subcon and embeds one expression inside another, and if that fails (eg. RawCopy inside IfThenElse), it tries to obtain a decompiled version, and embeds its _parse method inside outer expression.

If a decompilable instance gets compiled (eg. GreedyRange inside RawCopy) it tries to obtain a decompiled version of subcon, and embeds one ctor inside another, and if that fails (eg. FormatField inside GreedyRange), it tries to obtain a compiled parser (an expression) and builds a Decompiled instance that is a lightweight wrapper, and embeds that instance inside a ctor.

In summary, compilable instances prefer compilable subcons, and decompilable instances prefer decompilable subcons. Bridging is possible both ways, but involves some wrappers. Even tho the wrappers are lightweight, compiler attemps to maximize efficiency. This also solves the mystery of last line creating a Compiled instance. Module must expose a Construct instance, not an expression or a function.

Third example, discussing compiler using a cache:

inner = Struct(
    "innerfield1" / Int8ub,
)
Struct(
    "field1" / inner,
    "field2" / inner,
    "field3" / RawCopy(Int8ub),
    "field4" / RawCopy(Int8ub),
)
def parse_struct_2(io, this):
    this = Container(_ = this)
    try:
        this['innerfield1'] = unpack('>B', read_bytes(io, 1))[0]
    except StopIteration:
        pass
    del this._
    return this
decompiled_5 = Decompiled(lambda io,this: unpack('>B', read_bytes(io, 1))[0])
decompiled_3 = RawCopy(decompiled_5)
decompiled_6 = RawCopy(decompiled_5)
def parse_struct_1(io, this):
    this = Container(_ = this)
    try:
        this['field1'] = parse_struct_2(io, this)
        this['field2'] = parse_struct_2(io, this)
        this['field3'] = decompiled_3._parse(io, this, None)
        this['field4'] = decompiled_6._parse(io, this, None)
    except StopIteration:
        pass
    del this._
    return this
def parseall(io, this):
    return parse_struct_1(io, this)
compiledschema = Compiled(None, None, parseall)

Compiler caches compilation results of both compilable and decompilable instances. This has the benefit of generating less code (where same function or same Construct instance can be used more than once), thus increasing efficiency of CPU cache. Compilable instance (like Struct) sometimes appends to generated code an entire function and results/caches that function name. Decompilable instance appends one line to generated code, assigning a Construct instance to some random name, and results/caches that name. Simple instances like FormatField simply result/cache an expression.

Example shows that “inner” struct is used twice, and so is “parse_struct_2”, and since Byte is a singleton, so is “decompiled_5”.

Empirical evidence

The “shortcuts” that are described above are not much, but amount to quite a large portion of actual run-time. In fact, they amount to about a third (31%) of entire run-time. Results copied from earlier section.

>>> print(st.benchmark(sampledata))
Timeit measurements:
compiling:         0.00048725021999416638 sec/call
parsing:           0.00002787889000046562 sec/call
parsing compiled:  0.00001943664999998873 sec/call
building:          0.00003316365799946653 sec/call
building compiled: 0.00003364123299979837 sec/call