What is intermediate representation(IR)?

We can solve any problem by introducing an extra level of indirection.

– Fundamental theorem of software engineering1

In general, intermediate representation(IR)2 describes a stream of data in a common form which can be processed by a series of operation.

This term originates from compiler design. People design IRs to represent an intermediate language between the source and target language, and perform passes/transforms on the IR to analyze or optimize the code.

An advantage of using an IR is that our passes can be modularized and decoupled, which makes it easy to manage passes(e.g. to resolve dependencies among passes).

IR also prevents the combinatorial explosion if we want to support multiple source and target languages. Imagine that if we had a monolithic architecture to manage all optimizations, given \(n\) source languages and \(m\) target languages, the “implementation complexity” could be \(\mathcal{O}(n m)\) in the worst case.

When memory becomes a major bottleneck

A question rises when you are designing an IR: how do you pass the IR from one pass to another?

One common choice is to use the main memory. But this may cause serious performance issue in current memory architecture, given that the latency gap between CPU and main memory becomes increasingly large.

In a naive implementation of IR architecture, each pass fetches the input IR from memory first, performs the operation, writes the output to the memory, and finally passes this processed data to the next one. The repeated store-load operations would be a terrible bottleneck.

Our goal is to make our working set as small as possible to fit it into cache, or even directly use variables(registers) to communicate with the next pass just like what we do in a monolithic architecture.

But in practice this is not always easy. We will see several (negative and positive) examples in the next section, and learn how they solve this problem under different situations.

Case study

Pass architecture in compiler design

I believe the IR and Pass architecture is the standard way to manage the transform passes in a compiler.

One common problem in compiler optimization is to traverse a graph or a tree for pattern matching and do some modification on that. Usually the pattern matching requires the information from several layers of the graph/tree, or even the whole data structure. This access pattern exposes nearly no locality. So it is hard to have a small working set, and it is very common to pass the whole data structure between passes through main memory.

Compared with the temporal complexity of algorithms used in compiler construction, this memory overhead can be omitted in most cases.

Blender Modifier architecture

Blender is an open source 3D software. It implements a Modifier system to pipeline the modification on a geometry. I have several introductory posts (Blender modifier system part 13, part 24, part 35) on this topic.

In general, the geometric data is stored in a stack, and this data is passed through a series of modifiers for deformations, sub-divisions, and many other geometric processing.

In current implementation, all of this data is passed through main memory. Usually the mesh data is too large to fit in a cache line. So repeated memory store and load can be a significant overhead.

However the geometric processing is a CPU bound task which has tremendous heavy floating point operations. The memory performance does not become a bottleneck.

And in practice, an artist usually does not use too many modifiers so the overhead will not accumulate and impact the performance too much.

Rendering pipeline and shading language

In real time rendering, the whole process is arranged as a pipeline. The geometric data flow through the pipeline with shaders applied.

For vertex processing, we need to provide vertex shaders to the pipeline. In each shader program, it will have several inputs and outputs. Many shader programs(the actual number depends on the complexity of shader and your artist workflow) are linked together through these input and output ports.

The performance would be bad if these inputs and outputs communicate through memory(GPU main memory).

In practice, a shader compiler reads in the source code of shader programs and input/output ports information to decide the final memory layout. Some data can even flow through the pipeline without spilling to memory.

RapidJSON and concept based design

RapidJSON6 is a C++ JSON parsing/serialization library, which seeks for the most optimal performance.

In RapidJSON, the transcoding is modeled in a way which can be fit into our data flow/IR framework: the JSON data in source encoding gets read in as input, then processed by transcoder, and finally output in target encoding.

In a trivial implementation, whole JSON data is read in and stored in a intermediate storage, which can be seen as a form of IR, then the transcoder loads the data and performs the transcoding.

Obviously the store and load operations can be eliminated. In RapidJSON, it models the data flow as a stream. All read and write occur directly through the stream without redundant memory access.

Besides, RapidJSON parameterizes the behavior of various kinds of streams and transcoders by static binding(C++ template parameter, or “concept” in modern C++ glossary). In this way, the code can be inlined to the most extent, and it can achieve the performance comparable to monolithic implementation, while preserving the flexibility and extensibility at the same time.

Pay attention to the memory access pattern in this case. The locality is the key reason why this design works.

Milo Yip(the author of RapidJSON) has a detailed explanation(in Chinese)7 on this part of code, which worth a read.

Summary

The memory performance may not be a bottleneck if there is not too much store-load cycles or most passes are heavily CPU bound.

Otherwise the stream concept can be used if we can identify the locality. If the target platform has complex memory hierarchy, designing a DSL and implementing a compiler can help us to decide the memory layout.

Finally, it seems we can do nothing if the application exhibits nearly no locality.

References