A shader compiler for VirtualDub

For a while now I've been wishing that I had some nicer way to write video filters. I've been working on VirtualDub long enough that the actions necessary to create a new video filter, write the callbacks, write the pixel processing routines, etc. are mostly second nature. However, it's still a lot of work to toss out a quick video filter. An example was a person who emailed me a while back asking for a YCbCr-to-RGB video filter, because he had the interesting situation of trying to record YPbPr component outputs connected to the RGB component inputs of a video capture card. In the end, it took me a couple of hours to bang out a test filter that ideally should have only taken a few minutes. It would have been nice to come up with a language to write quick filters in, even though it might be slower than straight C++ code:

Well, a couple of weeks ago I sat down and starting banging out such a system. Presenting VDShader:

http://www.virtualdub.org/downloads/vdshader-1.0.zip

http://www.virtualdub.org/downloads/vdshader-1.0-src.zip

You need VirtualDub to process video with it (I tested it with 1.6.19 and 1.7.2), but if you just want to play with the editor and compiler, you can run the shader editor as follows:

rundll32 vdshader.vdf,OpenEditor

The VDShader video filter is probably the most complex video filter for VirtualDub that I've ever written, because it consists of:

A shader compiler.

A shader bytecode interpreter.

A shader bytecode x86 JIT engine.

A text editor.

There are still a lot of things missing from the filter, particularly a lot of primitives in the language, but to give you some idea of what a shader looks like, here's the convert-to-grayscale example shader:

sampler src; float4 PS(float2 uv: TEXCOORD0) : COLOR0 { float4 px; return dot(tex2D(src, uv).rgb, float3(0.30, 0.59, 0.11)); }

technique { pass { PixelShader = compile ps_3_0 PS(); } }

The shader language parsed by the filter is similar to a subset of the kinds of vector processing languages processed by NVIDIA's Cg compiler and Microsoft's High Level Shader Language (HLSL). I chose a similar syntax because I'm familiar with HLSL and this style of syntax is the most fluid syntax I've seen so far for working with colors and vectors. The idea is that if someone wants to try a really quick image processing operation, like scaling green by 80%, it can be expressed in this filter by shader code as simple as (pixel.g * 0.8).

(Details about the internals after the click.)

I had made a previous attempt at something like this with my earlier "Filter Factory" filter, but that had several shortcomings. The main one is that it worked in fixed point, which made it hard to use. Second, it had non-standard syntax. Third, it only had an interpreter, so it was slow. Fourth, I never really documented the language. The killer, though, was that the compiler had a ton of bugs in it, so it never really worked right. Part of the reason the compiler was so bad was that I'd written Filter Factory a long time ago, and had less experience in writing compilers, or even parsers/lexers. Another reason, though, is that Filter Factory built a tree from the code and then compiled that, which meant rather complex control flow during the compilation process, and meant that I spent a lot more time whacking tree nodes instead of actually generating code.

The parser and lexer weren't very hard to write, since I've written enough of them by now. If you're just starting out, you might be tempted to try lex/yacc (flex/bison for free software). They're good to bang something out quickly, but if you're trying to learn, I'd recommend ditching them and writing the lexer and parser yourself. Most programming languages with C-like syntax don't need a full regular expression based lexer or LALR(1) table-based parser -- they can be parsed fine with a simple scanner and a hand-written parser. I've found that it's a lot easier to deal with attribute passing and error handling with a hand-written recursive descent parser than to try to push everything in and out of a yacc-based state machine parser. I'm pretty sure that Visual C++ uses a parser of this type, and I think GCC also switched from a generated parser to a hand-written parser (which probably explains its lowered tendency to just report "parse error").

What I typically use now is a hybrid parser, with recursive descent for structures and operator-precedence for expressions. Operator-precedence avoids having to write zillions of routines for every precedence level, like ParseAdditiveExpression() and ParseMultiplicativeExpression(). (I think Avisynth's expression parser works this way.) I've also evolved a fashion of error handling which doesn't require exceptions, which is to have every parse routine return a bool and propagate up false return values, have the set error routine always return false, and just return the result of the error routine:

if (!ParseExpression(&value)) return false; ... return lexer.SetErrorF("Cannot convert from type '%s' to type '%s'", GetTypeName(vA.mType), GetTypeName(vB.mType));

This is nice for when you're in an environment that doesn't have C++ exception support.

The compiler, this time, is structured as a bottom-up compiler, meaning that it generates shader instructions on the fly as it parses the source. Variables get allocated registers on the spot, and operations on them immediately emit arithmetic instructions. This is a simpler type of compiler to write, but not building an abstract syntax tree (AST) makes some types of high-level transformations difficult or impossible. For instance, algebraic identities are difficult to apply to the assembly output, since complex operations like matrix multiplication are already exploded in the parser. On the other hand, the optimizer turned out to perform better than I had expected, especially given that it only does copy propagation and dead store elimination. Copy propagation detects when copies of values are used and replaces references to the copy with the original, so that given this set of instructions:

mov r0, c0 add r1, r1, r0

...the compiler detects that r0 (temp 0) is a copy of c0 (constant 0), and changes the ADD instruction to use c0 directly:

mov r0, c0 add r1, r1, c0

The dead store elimination (a form of dead code elimination) pass then detects that r0 is never used, and deletes the MOV instruction, producing a better result:

add r1, r1, c0

A factor that made these optimizations easier to implement is that the bottom-up compiler naturally generates code in single static assignment (SSA) form. The old Filter Factory compiler reused registers, which made it really difficult to apply such optimizations because full register life information would have been needed. With SSA, there's only one instruction that writes to a virtual register, meaning that you only need to check if that register is used at all and then kill the instruction that writes it if not. It also helps that I don't support looping or branching, although branching is the easier of the two (it can be handled with a merge instruction, which in the case of a pixel shader would be something like cmp or cnd).

The RGB to YUV shader thus produces the following assembly:

def c0, 0.5, 0.299, -0.1687, 0 def c1, -0.4187, 0.587, -0.3313, 0 def c2, -0.0813, 0.114, 0.5, 0 def c3, 0.501961, 0, 0.501961, 0 texld r1, r0, s0 mul r2.xyz, r1.x, c0 mad r3.xyz, r1.y, c1, r2 mad r4.xyz, r1.z, c2, r3 add r5.xyz, r4, c3 mov r6.xyz, r5 mov r6.w, r1.w mov o0, r6

This isn't too bad, especially considering that the unoptimized assembly is usually horrible, with as much as ~80% being useless register-to-register moves, and gets really bad whenever the compiler inlines a function call. For comparison, this is what the Microsoft HLSL compiler generates from the same source (fxc /Tfx_2_0 rgb2yuv.fx):

ps_3_0 def c0, 0.5, -0.41870001, -0.0812999979, 0 def c1, 0.298999995, 0.587000012, 0.114, 0 def c2, -0.168699995, -0.33129999, 0.5, 0 def c3, 0.501960814, 0, 0, 0 dcl_texcoord v0.xy dcl_2d s0 texld r0, v0, s0 dp3 r1.x, r0, c0 dp3 r1.y, r0, c1 dp3 r1.z, r0, c2 mov oC0.w, r0.w add oC0.xyz, r1, c3.xyxw

The main problem I've noticed is that my compiler is unable to deal with splitting or merging instructions as necessary to eliminate redundant mov instructions. You might also have noticed that I haven't bothered yet to make the output of my compiler actually valid D3D assembly, since I don't actually ever try running the shader on hardware (there isn't a vertex shader anyway).

Following this, the code goes through a vector-to-scalar recompiler, which takes all of the vector instructions -- which process as many as four values in parallel -- and recompiles them to individual scalar instructions which only handle one component at a time. The result then goes through similar copy propagation and dead store elimination routines. This catches a lot of the redundancy that the vector compiler is unable to eliminate. In the case of the above, this is the result:

def c0, 0.5 def c1, 0.299 def c2, -0.1687 def c3, -0.4187 def c4, 0.587 def c5, -0.3313 def c6, -0.0813 def c7, 0.114 def c8, 0.5 def c9, 0.501961 def c10, 0 def c11, 0.501961 def c12, 0 def c13, 0 def c14, 0 def c15, 0 texld r0, c12, c13 unpackr r1, r0 unpackg r2, r0 unpackb r3, r0 unpacka r0, r0 mul r4, r1, c0 mul r5, r1, c1 mul r1, r1, c2 mul r6, r2, c3 add r4, r6, r4 mul r6, r2, c4 add r5, r6, r5 mul r2, r2, c5 add r1, r2, r1 mul r2, r3, c6 add r4, r2, r4 mul r2, r3, c7 add r5, r2, r5 mul r3, r3, c8 add r1, r3, r1 add r4, r4, c9 add r5, r5, c10 add r1, r1, c11 end c0, c0, c0

There do tend to be a lot of useless constants in the result, since they don't really hurt (I have no limit, unlike real hardware). One big optimization that is missing is constant folding, which would compile out useless operations on only constants, like ADD r0, c0, c1. In vector languages, it's common to do operations like (vec * float3(0.5, 1.0, 0.4)), because a vector operation is the same cost as a scalar operation and there's no cost for the useless multiply of green by 1.0. When recompiling to scalar code, however, the compiler could recognize that one of the factors is 1.0 and replace the multiply with a move, which would then get dropped out by copy propagation.

Following this, the x86 JIT compiler kicks in to compile the shader to machine code. I'll freely admit that my JIT compiler sucks, because it doesn't even try to do register caching -- if you look at the disassembly that appears, you'll see a ton of useless loads and stores. I won't even bother pasting it here because it's too long. However, it turns out that's not the biggest problem. JITting turned out to help a lot less than I'd expected because the biggest bottleneck right now isn't the shader code itself, but the texture sampling, because right now it's a brute force bilinear sampling routine. Because the shader uses x87 floating point, I can't use MMX in that routine and thus it's quite slow at the moment. I'm probably going to try writing an SSE2 version, but the main thing I need to do is finish the sampler support in the compiler so that you can switch the filtering mode to POINT, which would eliminate the need for the bilinear sampling entirely in many shaders.

The last bit of code that I'd decided to write from scratch is the text editor. Truth be told, I hadn't expected or wanted to write a custom text editor for this, but I ended up doing so because I had grown tired of repeatedly having to work around shortcomings and bugs in the Microsoft Rich Edit control, such as:

You can't append text to the control using window messages without scrolling it, since you have to move the cursor to do it.

The Text Object Model (TOM) interfaces almost let you insert without scrolling, except that it does so anyway if the insertion position is visible and the window is scrolled to a partial line. Oh, and it sometimes ends up with the caret displayed on the wrong line afterward.

You can get and set the scrolling position at a pixel resolution, but it always aligns the window vertically to a whole line when you set it, so you can't perfectly restore it.

You can tell the rich edit control to use either plain text or rich text, but you can't tell it to display rich text while only allowing entry of plain text. So much for syntax highlighting.

You can't set the tab size. You only set tab stops on existing text, but only in twips (twentieths of an inch), not in monospace characters. Oh, and you can't set the tab size after the fixed tab stops, or the tab stops for new text. Did I mention that you can't set the tab size???

Searching up is just broken -- I was able to get searching down to wrap with extra logic, but couldn't ever get RichEdit to search up properly from the bottom with the cursor at the top. The docs are FUBARed too, with EM_FINDTEXTEX saying that the search always happens from the current selection, but the FINDTEXTEX structure contains a "chrg" field that "specifies the range of characters to search." I couldn't figure out how this field actually works, because it seems that sometimes the control uses it and sometimes it doesn't.

After I'd had enough of this I decided to rip RichEdit out and bang out a new text editor over the weekend. It's not the greatest editor in the world, particularly since it doesn't support Undo/Redo right now, but at least syntax highlighting was a lot easier to implement and didn't require stupid workarounds. It also only works on 8-bit text right now, so it won't pass Michael Kaplan's "buttload of diacritics" test, but given that this is a source code editor I didn't worry about localization. Besides, converting it to Unicode at least wouldn't be too bad.

One of the toughest features to work out turned out to be caret positioning in word wrap mode -- I'd originally written it using iterators stored as (paragraph, char offset), but discovered that it led to the undesirable behavior that you couldn't place the caret at the end of a wrapped line, because that was the same character position as the start of the next. I ended up rewriting the editor to use (paragraph, line, char offset) instead, which turned out to work better and was actually simpler in some ways. I'm also proud of the fact that the editor doesn't use any fixed size line buffers anywhere, so it doesn't have any line length limitations, like the 2048 character limitation in Visual C++ 6.0.

Anyway, if you run into bugs in the filter, feel free to drop me a line. This has been somewhat of a research project, so I don't know how much more time I'll be putting into it, but I figure it'd at least get it to a useful enough point that simple filters could be written without the need for a real compiler.