Optimizing Lever PNG loading

The following code took over 9.4 seconds to run:

import png for i in range(10) image = png.read_file(dir ++ "frants_boe_villender_test.png")

The image to load was a 640x480 RGBA image from a painting of Frants Diderik Bøe, who specialized in still life and landscapes.

The program became slow when I added the scanline filtering, so I would expect to know what slows it down. But I am really not on the mood of guessing wrong, so let's run the code through a profiler like this:

import png, fs, vmprof vmprof_fd = fs.open(dir ++ "png.vmprof", fs.WRONLY | fs.TRUNC | fs.CREAT) vmprof.enable(vmprof_fd, 0.0001) for i in range(10) image = png.read_file(dir ++ "frants_boe_villender_test.png") vmprof.disable() vmprof_fd.close()

The profiler output:

100.0% L3_16 100.0% png_load.lc:3 (line 5 at this post) 100.0% .. L5_39 100.0% png.lc:5 99.6% .... L53_82 99.6% png.lc:53 0.1% ...... L87_90 0.1% png.lc:87 0.1% ...... L90_96 0.1% png.lc:90 49.4% ...... L99_101 49.6% png.lc:99 35.4% ........ L104_117 71.6% png.lc:104 16.9% .......... L117_131 47.9% png.lc:117 0.6% ...... L96_99 0.6% png.lc:96 0.3% .... L142_159 0.3% png.lc:142 0.1% ...... L159_161 37.5% png.lc:159

At the png.lc:53 we have the code that runs a decoding loop:

read = (self, data): data = self.z.decompress(data) for byte in data if self.new_scanline self.filter = decode_filters[byte] self.new_scanline = false continue if self.l == 0 u = 0 c = 0 if self.x - self.stride >= self.l l = self.data[self.x - self.stride] else l = 0 else u = self.data[self.x - self.y_stride] if self.x - self.stride >= self.l l = self.data[self.x - self.stride] c = self.data[self.x - self.y_stride - self.stride] else l = 0 c = 0 self.data[self.x] = self.filter(byte, l, u, c) self.x += 1 if self.x >= self.r self.l += self.y_stride self.r += self.y_stride self.new_scanline = true

It tells that almost half of the time is spent in the png.lc:99.

decode_filters = { 4: (byte, l, u, c): return byte + paeth_predictor(l, u, c) } paeth_predictor = (a, b, c): p = a + b - c pa = abs_int(p - a) pb = abs_int(p - b) pc = abs_int(p - c) if pa <= pb and pa <= pc return a elif pb <= pc return b else return c # TODO: make them into multimethods abs_int = (a): if a < 0 return -a return a

The Line 60 is the paeth_predictor . And 73 is the abs_int .

Motive

There are many C libraries meant for loading PNG images, that I could have integrated into my runtime.

Lever is a gradually designed programming language. The observations made about the behavior of the language are used to improve it even further. The intention is to leave in the possibility for changing the language, so that the language can refine over time to meet on newly made discoveries.

One of my objectives with Lever is to make it into a high level programming language that obtains good runtime performance own its own, and then supply it with utilities to optimize those implementations to match the performance that is obtainable with the C language.

To do this anywhere else, we need to get started with something simple, and it would be preferable to be practical. Therefore I decided to write my own PNG implementation in Lever.

First the abs multimethod

The abs_int is a crutch that I made because Lever's abs -function only operates on floats. The limitation isn't very elegant though. I've intended to remove it when it becomes more relevant.

When you're optimizing code, it would be preferable to start with a version that doesn't have any crutches originating from unimplemented features in the programming language's runtime, so the first obvious thing here is to remove the abs_int and replace it with properly implemented abs multimethod.

The implementation of abs is in the runtime/vectormath.py . As pointed out by the documentation for abs(). We replace it with the multimethod implementation:

abs_ = Multimethod(1) @abs_.multimethod_s(Float) def abs_float(f): return Float(-f.number) if f.number < 0.0 else f.number @abs_.multimethod_s(Integer) def abs_int(i): return Integer(-i.value) if i.value < 0 else i

The Lever runtime needs to be retranslated for the changes to take effect. We can remove the abs_int and replace it with abs in the paeth_predictor.

Meanwhile we have time to think about the paeth predictor. When you look at the variable flow, every math operation appears to require the previous value. This is an important observation because we are running on a superscalar processor.

Although we are still somewhere on the dark side of the moon when it comes to the instruction-level optimizations. When I run the program again, it now takes 8 seconds to run.

100.0% L3_16 100.0% png_load.lc:3 (line 5) 100.0% .. L5_39 100.0% png.lc:5 99.4% .... L53_82 99.4% png.lc:53 (line 27) 0.9% ...... L96_99 0.9% png.lc:96 0.2% ...... L90_96 0.2% png.lc:90 32.4% ...... L99_101 32.6% png.lc:99 (line 56) 15.3% ........ L104_125 47.1% png.lc:104 (line 60) 0.1% ...... L87_90 0.1% png.lc:87 0.4% .... L136_153 0.4% png.lc:136 0.3% ...... L153_155 62.5% png.lc:153 0.1% .... L125_136 0.1% png.lc:125 0.1% ...... L153_155 100.0% png.lc:153

So in total we shaved 17% with that small change. But it is still the paeth predictor and the PNG decoding loop that's slowing us down.

Checking the JIT traces

The JIT trace is perhaps showing us something interesting. The following command extracts it for us and places it into the pnglog :

PYPYLOG=jit-log-opt:pnglog lever samples/png/png_load.lc

This is the first time when I am using the JIT trace to optimize a program, so my function for the printable location only returned the program counter:

debug_merge_point(0, 0, 'pc=25 setloc module=<module png>') debug_merge_point(0, 0, 'pc=29 getglob module=<module png>') debug_merge_point(0, 0, 'pc=32 getloc module=<module png>') debug_merge_point(0, 0, 'pc=35 getloc module=<module png>') debug_merge_point(0, 0, 'pc=38 getglob module=<module png>') debug_merge_point(0, 0, 'pc=41 call module=<module png>') +1119: i119 = int_sub(i118, i110)

I adjusted it to provide line counts too. I'm sure it needs something better later on:

debug_merge_point(0, 0, 'pc=25 setloc module=<module png>:105') debug_merge_point(0, 0, 'pc=29 getglob module=<module png>:105') debug_merge_point(0, 0, 'pc=32 getloc module=<module png>:106') debug_merge_point(0, 0, 'pc=35 getloc module=<module png>:106') debug_merge_point(0, 0, 'pc=38 getglob module=<module png>:106') debug_merge_point(0, 0, 'pc=41 call module=<module png>:106') +1119: i119 = int_sub(i118, i110)

After dropping out lots of debug_merge_points , I see that the paeth predictor ends up like this:

+916: guard_nonnull_class(p3, ConstClass(Integer), descr=<Guard0x7effcc4a9db0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104] +941: guard_nonnull_class(p5, ConstClass(Integer), descr=<Guard0x7effcc4a9e08>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104] +966: guard_not_invalidated(descr=<Guard0x7effcc475e20>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104] +980: p108 = getfield_gc_r(ConstPtr(ptr107), descr=<FieldP space.multimethod.Multimethod.inst_version 48>) +991: guard_value(p108, ConstPtr(ptr109), descr=<Guard0x7effcc475e60>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104] +1000: i110 = getfield_gc_i(p3, descr=<FieldS space.numbers.Integer.inst_value 8 pure>) +1011: i111 = getfield_gc_i(p5, descr=<FieldS space.numbers.Integer.inst_value 8 pure>) +1022: i112 = int_add(i110, i111) +1032: guard_nonnull_class(p7, ConstClass(Integer), descr=<Guard0x7effcc4a9e60>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104] +1071: p115 = getfield_gc_r(ConstPtr(ptr114), descr=<FieldP space.multimethod.Multimethod.inst_version 48>) +1089: guard_value(p115, ConstPtr(ptr116), descr=<Guard0x7effcc475ea0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104] +1098: i117 = getfield_gc_i(p7, descr=<FieldS space.numbers.Integer.inst_value 8 pure>) +1109: i118 = int_sub(i112, i117) +1119: i119 = int_sub(i118, i110) +1147: p121 = getfield_gc_r(ConstPtr(ptr120), descr=<FieldP space.multimethod.Multimethod.inst_version 48>) +1165: guard_value(p121, ConstPtr(ptr122), descr=<Guard0x7effcc475ee0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104] +1174: i124 = int_lt(i119, 0) +1182: guard_false(i124, descr=<Guard0x7effcc475f20>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104] +1188: i125 = int_sub(i118, i111) +1198: i127 = int_lt(i125, 0) +1202: guard_false(i127, descr=<Guard0x7effcc475f60>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104] +1208: i128 = int_sub(i118, i117) +1218: i130 = int_lt(i128, 0) +1222: guard_false(i130, descr=<Guard0x7effcc475fa0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104] +1235: p132 = getfield_gc_r(ConstPtr(ptr131), descr=<FieldP space.multimethod.Multimethod.inst_version 48>) +1246: guard_value(p132, ConstPtr(ptr133), descr=<Guard0x7effcc4c0020>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104] +1255: i134 = int_le(i119, i125) +1262: guard_true(i134, descr=<Guard0x7effcc4c0060>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104] +1268: i135 = int_le(i119, i128) +1275: guard_true(i135, descr=<Guard0x7effcc4c00a0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104] +1281: leave_portal_frame(0)

Superfluous guards

I even found something I should no longer have there:

+980: p108 = getfield_gc_r(ConstPtr(ptr107), descr=<FieldP space.multimethod.Multimethod.inst_version 48>) +991: guard_value(p108, ConstPtr(ptr109), descr=<Guard0x7effcc475e60>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104] +1071: p115 = getfield_gc_r(ConstPtr(ptr114), descr=<FieldP space.multimethod.Multimethod.inst_version 48>) +1089: guard_value(p115, ConstPtr(ptr116), descr=<Guard0x7effcc475ea0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104] +1147: p121 = getfield_gc_r(ConstPtr(ptr120), descr=<FieldP space.multimethod.Multimethod.inst_version 48>) +1165: guard_value(p121, ConstPtr(ptr122), descr=<Guard0x7effcc475ee0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104] +1235: p132 = getfield_gc_r(ConstPtr(ptr131), descr=<FieldP space.multimethod.Multimethod.inst_version 48>) +1246: guard_value(p132, ConstPtr(ptr133), descr=<Guard0x7effcc4c0020>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]

Every guard in JIT converts into a conditional and a conditional branch. They ensure that the program diverges from the fast path if it's not applicable. Making the version number pseudoimmutable should help a bit here:

class Multimethod(Object): _immutable_fields_ = ['arity', 'multimethod_table', 'interface_table', 'default?', 'version?']

The pseudoimmutability is marked when we expect that a value does not change often. The version number changes when the multimethod is updated. From the perspective of hot loops it is never changed.

We still have lots of guards here.

+820: guard_nonnull_class(p3, ConstClass(Integer), descr=<Guard0x7f9c21d1a1d8>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104] +845: guard_nonnull_class(p5, ConstClass(Integer), descr=<Guard0x7f9c21d1a230>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104] +870: guard_not_invalidated(descr=<Guard0x7f9c21ccbf60>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104] +870: i107 = getfield_gc_i(p3, descr=<FieldS space.numbers.Integer.inst_value 8 pure>) +888: i108 = getfield_gc_i(p5, descr=<FieldS space.numbers.Integer.inst_value 8 pure>) +892: i109 = int_add(i107, i108) +902: guard_nonnull_class(p7, ConstClass(Integer), descr=<Guard0x7f9c21d1a288>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104] +935: i111 = getfield_gc_i(p7, descr=<FieldS space.numbers.Integer.inst_value 8 pure>) +939: i112 = int_sub(i109, i111) +956: i113 = int_sub(i112, i107) +970: i115 = int_lt(i113, 0) +974: guard_false(i115, descr=<Guard0x7f9c21ccbfa0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104] +980: i116 = int_sub(i112, i108) +997: i118 = int_lt(i116, 0) +1001: guard_false(i118, descr=<Guard0x7f9c21d20020>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104] +1007: i119 = int_sub(i112, i111) +1017: i121 = int_lt(i119, 0) +1021: guard_false(i121, descr=<Guard0x7f9c21d20060>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104] +1027: i122 = int_le(i113, i116) +1034: guard_true(i122, descr=<Guard0x7f9c21d200a0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104] +1040: i123 = int_le(i113, i119) +1047: guard_true(i123, descr=<Guard0x7f9c21d200e0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104] +1053: leave_portal_frame(0)

And we are down to 7.1 seconds:

100.0% L3_16 100.0% png_load.lc:3 (line 5) 100.0% .. L5_39 100.0% png.lc:5 99.5% .... L53_82 99.5% png.lc:53 (line 27) 0.7% ...... L96_99 0.7% png.lc:96 0.1% ...... L87_90 0.1% png.lc:87 29.9% ...... L99_101 30.0% png.lc:99 (line 56) 12.3% ........ L104_128 41.2% png.lc:104 (line 60) 0.1% ...... L90_96 0.1% png.lc:90 0.2% .... L139_156 0.2% png.lc:139 0.1% ...... L156_158 25.0% png.lc:156 0.1% .... L40_53 0.1% png.lc:40

Somewhat illogically, the time spent in the paeth predictor increased!

Conditional branching

Another trick to try would be to remove the unnecessary branching by coalescing the conditionals. To do it we need a tool to do it. A select() function does fine.

@builtin @signature(Boolean, Object, Object) def select(cond, a, b): return a if cond == true else b

The optimizer will likely convert this thing into a conditional move, or so I hope. In every case it should eliminate additional branches that would turn into guards. The paeth_predictor looks like this now:

paeth_predictor = (a, b, c): p = a + b - c pa = abs(p - a) pb = abs(p - b) pc = abs(p - c) return select(pa <= pb and pa <= pc, a, select(pb <= pc, b, c))

The results were perhaps illogical. This was not any faster. When I looked into the trace, the guards are still there. Maybe our implementation has the wrong shape? Lets do it in a way that cannot be interpreted wrong:

@builtin @signature(Boolean, Object, Object) def select(cond, a, b): return [b, a][cond == true]

Even then it does not have the desired effect. Forming a new bytecode instruction that does this shouldn't have any effect either because the call is constant folded away anyway.

Reader loop

One problem in our reader loop is that we are doing whole lot of work inside it. We are checking bounds and fetching stuff for the filter:

data = self.z.decompress(data) for byte in data if self.new_scanline self.filter = decode_filters[byte] self.new_scanline = false continue if self.l == 0 u = 0 c = 0 if self.x - self.stride >= self.l l = self.data[self.x - self.stride] else l = 0 else u = self.data[self.x - self.y_stride] if self.x - self.stride >= self.l l = self.data[self.x - self.stride] c = self.data[self.x - self.y_stride - self.stride] else l = 0 c = 0 # run the filter and update the bounds self.data[self.x] = self.filter(byte, l, u, c) self.x += 1 if self.x >= self.r self.l += self.y_stride self.r += self.y_stride self.new_scanline = true

We are doing that for every 1 228 800 bytes. Yet we only have 480 scanlines!

We could instead do:

prior = self.prior scanline = self.scanline index = self.index data = self.z.decompress(data) while data.length > 0 if index == 0 self.filter = decode_filters[data[0]] data = data[1 .:] L = min(self.y_stride - index, data.length) for i in range(index, index+L) if i < self.stride l = 0 c = 0 else l = scanline[i - self.stride] c = prior[i - self.stride] u = prior[i] scanline[i] = data[i-index] + self.filter(l, u, c) if index + L >= self.y_stride prior = scanline scanline = self.data[self.next_scanline .: self.next_scanline + self.y_stride] index = 0 self.next_scanline += self.y_stride else index += L data = data[L .:] assert index < self.y_stride self.prior = prior self.scanline = scanline self.index = index

To always have an empty prior scanline, we initialize it to point to the second scanline. I also flip the addition out of the filter that helps a little bit.

It now takes 3.6 seconds at loading those 10 images.

100.0% L3_17 100.0% png_load.lc:3 (line 5) 100.0% .. L5_39 100.0% png.lc:5 99.2% .... L53_85 99.2% png.lc:53 (line 231) 0.2% ...... L90_91 0.2% png.lc:90 1.5% ...... L96_97 1.5% png.lc:96 27.2% ...... L99_110 27.4% png.lc:99 (line 192) 0.5% ...... L93_94 0.5% png.lc:93 0.3% .... L133_150 0.3% png.lc:133

ImageMagick as comparison

To compare, if I run copy the image 10 times and then run:

time convert *.png converted.jpg

It shows me that it's taking 0.225s to do the job and it's not only reading the png but also converting it into jpg. We are at least 10 to 20 times slower than the C program and it is a bit unacceptable.

Optimization of method lookup

We looked at the trace again with pypy channel and found yet a lookup_method that was unelided. The likely reason was that the method table was not perceived as immutable. This resulted in the following silly pair of guards:

+600: p172 = getfield_gc_r(p1, descr=<FieldP space.customobject.CustomObject.inst_custom_interface 8 pure>) +618: p175 = call_r(ConstClass(Interface.lookup_method), p172, ConstPtr(ptr174), descr=<Callr 8 rr EF=4>) +710: guard_no_exception(descr=<Guard0x7f7b2d3a7d00>) [p0, p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12, p13, p14, p15, p16, p17, p18, p19, p20, p21, p22, p23, p24, p25, p26, p27, p28, p29, p30, p31, p32, p33, p34, p35, p36, p37, p38, p39, p40, p41, p42, p43, p44, p45, p46, p47, p48, p49, p50, p51, p52, p53, p54, p55, p56, p57, p58, p59, p60, p61, p62, p63, p64, p65, p66, p67, p68, p69, p70, p71, p72, p73, p74, p75, p76, p77, p78, p79, p80, p81, p82, p83, p84, p85, p86, p87, p88, p89, p90, p91, p92, p93, p94, p95, p96, p97, p98, p99, p100, p101, p102, p103, p104, p105, p106, p107, p114, p115, p116, p117, p118, p119, p120, p121, p122, p123, p124, p125, p126, p127, p128, p129, p130, p131, p132, p133, p134, p135, p136, p137, p138, p139, p140, p141, p142, p143, p144, p145, p146, p147, p148, p149, p150, p151, p152, p153, p154, p155, p156, p157, p158, p159, p160, p161, p162, p175, i171, i166] +725: guard_isnull(p175, descr=<Guard0x7f7b2d3ac020>) [p0, p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12, p13, p14, p15, p16, p17, p18, p19, p20, p21, p22, p23, p24, p25, p26, p27, p28, p29, p30, p31, p32, p33, p34, p35, p36, p37, p38, p39, p40, p41, p42, p43, p44, p45, p46, p47, p48, p49, p50, p51, p52, p53, p54, p55, p56, p57, p58, p59, p60, p61, p62, p63, p64, p65, p66, p67, p68, p69, p70, p71, p72, p73, p74, p75, p76, p77, p78, p79, p80, p81, p82, p83, p84, p85, p86, p87, p88, p89, p90, p91, p92, p93, p94, p95, p96, p97, p98, p99, p100, p101, p102, p103, p104, p105, p106, p107, p114, p115, p116, p117, p118, p119, p120, p121, p122, p123, p124, p125, p126, p127, p128, p129, p130, p131, p132, p133, p134, p135, p136, p137, p138, p139, p140, p141, p142, p143, p144, p145, p146, p147, p148, p149, p150, p151, p152, p153, p154, p155, p156, p157, p158, p159, p160, p161, p162, p175, i171, i166]

Giving the JIT a hint to promote the custom_interface into a constant solves this problem. It means that the lookup_method can be then elided and the JIT can figure out that it never returns anything other than null and instead it just retrieves the value from a storage object.

After the improvement the program has shaved down to 2.7 seconds and the paeth filter appears to start becoming a bottleneck again.

100.0% L3_17 100.0% png_load.lc:3 (line 5) 100.0% .. L5_39 100.0% png.lc:5 98.7% .... L53_86 98.7% png.lc:53 (line 231) 0.1% ...... L91_92 0.1% png.lc:91 1.9% ...... L97_98 1.9% png.lc:97 31.8% ...... L100_111 32.2% png.lc:100 (line 192) 0.1% ...... L94_95 0.1% png.lc:94 1.0% .... L134_151 1.0% png.lc:134 0.1% ...... L151_153 14.3% png.lc:151

An insight into JIT

It is perhaps that I got some intuition about the behavior of the metatracing JIT when I found out about the above problem.

At the start of installing a JIT driver I have to tell it the color of certain variables:

jitdriver = jit.JitDriver( greens=['pc', 'block', 'module', 'unit', 'excs', 'function'], #, 'ec'], reds=['frame'], # 'regv', virtualizables = ['frame'], get_printable_location=get_printable_location, get_unique_id=get_unique_id, is_recursive=True)

The need to specify red variables is an implementation detail here, but the green variables are needed by the JIT. The greens tell which variables are supposed to stay constant during the JIT translation.

When a trace meets a green variable, it puts a guard for it and treats the program as if it that variable remained constant in the program. The information allows the tracing JIT to do deduction and optimize the trace.

Variables that can be deduced to be constants, such as the constants inside constants can be implicitly marked green without a guard. In other hand if the guard fails, then the program needs to branch and has to trace again.

In a well-tuned JIT, methods that are not dynamically used are promoted into constants.

Shifting filter call out of the loop

It turns out that we can still squeeze some juice by restructuring the program. Lets take this loop inside the reader loop:

for i in range(index, index+L) if i < self.stride l = 0 c = 0 else l = scanline[i - self.stride] c = prior[i - self.stride] u = prior[i] scanline[i] = data[i-index] + self.filter(l, u, c)

And replace it with this:

self.filter(prior, scanline, data[.: L], self.stride, index)

Then we write several smaller loops like this:

# average ((prior, scanline, data, stride, offset): for i in range(offset, offset+data.length) u = prior[i] if i >= stride u = scanline[i - stride] + u scanline[i] = data[i - offset] + u // 2 ),

This shaves the runtime speed down to 1.9 seconds. It means that one image takes 200ms to load.

100.0% L3_17 100.0% png_load.lc:3 (line 5) 100.0% .. L5_39 100.0% png.lc:5 98.7% .... L53_78 98.7% png.lc:53 (line 231, with 289) 0.8% ...... L85_91 0.9% png.lc:85 4.4% ...... L98_104 4.5% png.lc:98 84.6% ...... L106_126 85.7% png.lc:106 (line 300) 0.8% ...... L93_96 0.9% png.lc:93 0.6% .... L149_166 0.6% png.lc:149 0.2% ...... L166_168 33.3% png.lc:166

The statistics show out, perhaps reassuringly, that the paeth filter constitutes 80% of the runtime. If we would somehow manage to shave it to tenth of its time then it would take only 550ms from the program to run the benchmark.

How much faster could it run?

Now there's an interesting question: If I optimized just the paeth-predictor loop, how fast could this program run?

If I temporarily disable the paeth loop, the app then takes 0.3s to run. So if that part of the program didn't run at all, that would be the limit that we cannot exceed by optimizing that only function in the program.

If we managed to optimize all of the inner loops, the vmprofile would then start to look like this:

100.0% L3_17 100.0% png_load.lc:3 (line 5) 10.0% .. L30_41 10.0% png_load.lc:30 86.0% .. L5_39 86.0% png.lc:5 6.0% .... L154_171 7.0% png.lc:154 76.0% .... L53_78 88.4% png.lc:53 (line 231, with 289)

The assembler

Motivated from the fast paeth filter, I decided to pick up the google/CPU-instructions dataset and implement an assembler that uses those tables.

If you are ever building an assembler remember that it helps a whole lot if you have a machine-readable table that tells the encoding and operands for every instruction.

While writing my own I also noticed that the encoding task is easy if you break it into two stages. In the first stage you fill up the parameters such as:

rex byte

operand_op field (for operands that embed into the opcode fields)

modrm byte

sib byte

displacement offset

immediate values passed in

Doing this kind of scanning through the operands makes it easier to determine what is the size of the displacement field or where the SIB byte belongs to.

The CPU-instructions also contains SSE and AVX instructions. And I expected they would require encoding of VEX or EVEX prefix. I didn't do it right yet, but I am sure to return on the subject sometime soon. In any case the proper use of those instructions requires feature detection code.

After looking at the mnemonics for a while, I decided to not use them and instead treat them as a form of documentation that describes what semantic the instruction follows.

Returns and branches that have far and near versions share the same mnemonic, so I found it a bit risky to just let a program select a certain instruction from the list and run with it. To work around this problem, I gave an unique identifier for every instruction in the list:

5931 XCHG AX r16 [64] [L] Exchange r16 with AX. 5932 XCHG EAX r32 [64] [L] Exchange r32 with EAX. 5933 XCHG RAX r64 [64] Exchange r64 with RAX. 5934 XCHG m16 r16 [64] [L] Exchange r16 with word from r/m16. 5935 XCHG m32 r32 [64] [L] Exchange r32 with doubleword from r/m32. 5936 XCHG m64 r64 [64] Exchange r64 with quadword from r/m64. 5937 XCHG m8 r8 [64] [L] Exchange r8 (byte register) with byte from r/m8. 5938 XCHG m8 r8 [64] Exchange r8 (byte register) with byte from r/m8. 5939 XCHG r16 AX [64] [L] Exchange AX with r16. 5940 XCHG r16 m16 [64] [L] Exchange word from r/m16 with r16. 5941 XCHG r16 r16 [64] [L] Exchange word from r/m16 with r16. 5942 XCHG r16 r16 [64] [L] Exchange r16 with word from r/m16. 5943 XCHG r32 EAX [64] [L] Exchange EAX with r32. 5944 XCHG r32 m32 [64] [L] Exchange doubleword from r/m32 with r32. 5945 XCHG r32 r32 [64] [L] Exchange doubleword from r/m32 with r32. 5946 XCHG r32 r32 [64] [L] Exchange r32 with doubleword from r/m32. 5947 XCHG r64 RAX [64] Exchange RAX with r64. 5948 XCHG r64 m64 [64] Exchange quadword from r/m64 with r64. 5949 XCHG r64 r64 [64] Exchange r64 with quadword from r/m64. 5950 XCHG r64 r64 [64] Exchange quadword from r/m64 with r64. 5951 XCHG r8 m8 [64] [L] Exchange byte from r/m8 with r8 (byte register). 5952 XCHG r8 m8 [64] Exchange byte from r/m8 with r8 (byte register). 5953 XCHG r8 r8 [64] [L] Exchange r8 (byte register) with byte from r/m8. 5954 XCHG r8 r8 [64] [L] Exchange byte from r/m8 with r8 (byte register). 5955 XCHG r8 r8 [64] Exchange r8 (byte register) with byte from r/m8. 5956 XCHG r8 r8 [64] Exchange byte from r/m8 with r8 (byte register).

Many of these are the same opcode but with different prefixes or flags. I don't have a method to calculate the size of these, and I do not have a heuristic on selecting them.

To represent Addresses in the memory operands, I picked the following format:

Address(type, offset, base=-1, index=-1, scale=1)

You can have the address without the base register if the additional 4 bytes from offset is tolerable.

Now I finally can emit instructions like this:

# 1006 LEA r64 m [64] Store effective address for m in register r64. emit(1006, Register(i64, 11), Address(i64, 0, 10, 13)) # ABS on R11 # 1127 MOV r64 r64 [64] Move r/m64 to r64. emit(1127, Register(i64, 13), Register(i64, 11)) # 1289 NEG r64 [64] Two's complement negate r/m64. emit(1289, Register(i64, 11)) # 335 CMOVL r64 m64 [64] Move if less (SF≠ OF). emit( 335, Register(i64, 11), Register(i64, 13))

I linked the loops by running the assembler twice and using ordinary variables for them:

emit( 878, Immediate(i8, loop_label - jl1_label)) jl1_label := output.length

As long as the instruction numbers and lengths do not change, the second run reaches the fixed point.

The paeth decoder written in the assembly such as above looks like this:

assert platform.arch == "x86_64" "It just might not work on x86" assert platform.name.startswith("linux") "Also this depends on the SystemV calling conventions." arg_0 = Register(i64, 7) # RDI arg_1 = Register(i64, 6) # RSI arg_2 = Register(i64, 2) # RDX arg_3 = Register(i64, 1) # RCX arg_4 = Register(i64, 8) arg_5 = Register(i64, 9) loop_label = 0 exit_label = 0 ja_label = 0 jl1_label = 0 jl2_label = 0 assemble = (): # 1811 PUSH r64 [64] Push r64. # we need additional register to clobber. emit(1811, Register(i64, 3)) # x .r3 emit(1811, Register(i64, 10)) # .r10 emit(1811, Register(i64, 11)) # .r11 emit(1811, Register(i64, 12)) # .r12 emit(1811, Register(i64, 13)) # a .r13 emit(1811, Register(i64, 14)) # b .r14 emit(1811, Register(i64, 15)) # c .r15 # i = 0 .r0 # 5973 XOR m64 r64 [64] r/m64 XOR r64. emit(5973, Register(i64, 0), Register(i64, 0)) emit(5973, Register(i64, 13), Register(i64, 13)) emit(5973, Register(i64, 15), Register(i64, 15)) # j = offset .arg_4(8) # 1289 NEG r64 [64] Two's complement negate r/m64. # 74 ADD r64 r64 [64] Add r64 to r/m64. # k = j - stride .r1 emit(1289, Register(i64, 1)) emit( 74, Register(i64, 1), Register(i64, 8)) # 878 JL rel8 [64] [L] Jump short if less (SF≠ OF). emit( 878, Immediate(i8, loop_label - jl1_label)) jl1_label := output.length emit(1254, Register(i64, 13), Address(i8, 0, 7, 1)) # [prior .r7 + k .r1] emit(1254, Register(i64, 15), Address(i8, 0, 6, 1)) # [scanline .r6 + k .r1] # loop: loop_label := output.length # 0475 CMP m64 r64 [64] Compare r64 with r/m64. # if i >= length goto exit emit( 475, Register(i64, 0), arg_5) # 0862 JAE rel32 [64] [L] Jump short if above or equal (CF=0). emit( 862, Immediate(i32, exit_label - ja_label)) ja_label := output.length ## Test # emit( 54, Address(i64, 0, arg_0.index), Immediate(i32, 1)) # 1254 MOVZX r64 m8 [64] Move byte to quadword, zero-extension. # x = input[i] emit(1254, Register(i64, 14), Address(i8, 0, 7, 8)) # [prior .r7 + j .r8] # Starting paeth predictor here. a.r13, b.r14, c.r15 # 1006 LEA r64 m [64] Store effective address for m in register r64. # 2327 SUB r64 m64 [64] Subtract r/m64 from r64. #p = a + b - c p.r10 emit(1006, Register(i64, 10), Address(i64, 0, 13, 14)) emit(2327, Register(i64, 10), Register(i64, 15)) emit(1289, Register(i64, 10)) # negate to use LEA again. # 335 CMOVL r64 m64 [64] Move if less (SF≠ OF). # 1127 MOV r64 r64 [64] Move r/m64 to r64. #pa = abs(a - p) p.r10 a.r13 pa.r11 emit(1127, Register(i64, 3), Register(i64, 13)) emit(1006, Register(i64, 11), Address(i64, 0, 10, 13)) # The a.r13 is free now. # ABS on R11 emit(1127, Register(i64, 13), Register(i64, 11)) emit(1289, Register(i64, 11)) # negate to use CMOVL emit( 335, Register(i64, 11), Register(i64, 13)) #pb = abs(b - p) p.r10 a.r13 pa.r12 emit(1006, Register(i64, 12), Address(i64, 0, 10, 14)) # ABS on R12 emit(1127, Register(i64, 13), Register(i64, 12)) emit(1289, Register(i64, 12)) # negate to use CMOVL emit( 335, Register(i64, 12), Register(i64, 13)) # Now we want to compare and CMOVL if relevant. emit( 475, Register(i64, 12), Register(i64, 11)) emit( 335, Register(i64, 11), Register(i64, 12)) emit( 335, Register(i64, 3), Register(i64, 14)) #pc = abs(c - p) p.r10 a.r13 pa.r11 emit(1006, Register(i64, 12), Address(i64, 0, 10, 15)) # ABS on R12 emit(1127, Register(i64, 13), Register(i64, 12)) emit(1289, Register(i64, 12)) # negate to use CMOVL emit( 335, Register(i64, 12), Register(i64, 13)) # Now we want to compare and CMOVL if relevant. emit( 475, Register(i64, 12), Register(i64, 11)) emit( 335, Register(i64, 11), Register(i64, 12)) emit( 335, Register(i64, 3), Register(i64, 15)) # Done with paeth, next we add. #emit(1254, Register(i64, 3), Address(i8, 0, 2, 0)) # [input .r2 + i .r0] # 78 ADD r8 m8 [64] [L] Add r/m8 to r8. emit( 78, Register(i8, 3), Address(i8, 0, 2, 0)) # [input .r2 + i .r0] # 1097 MOV m8 r8 [64] [L] Move r8 to r/m8. # scanline[j] = x emit(1097, Address(i8, 0, 6, 8), Register(i8, 3)) # [scanline .r6 + j .r8] # 0072 ADD r64 imm8 [64] Add sign-extended imm8 to r/m64. # i += 1 emit( 72, Register(i64, 0), Immediate(i8, 1)) # j += 1 emit( 72, Register(i64, 8), Immediate(i8, 1)) # clean a and c emit(5973, Register(i64, 13), Register(i64, 13)) emit(5973, Register(i64, 15), Register(i64, 15)) # k += 1 emit( 72, Register(i64, 1), Immediate(i8, 1)) # 878 JL rel8 [64] [L] Jump short if less (SF≠ OF). emit( 878, Immediate(i8, loop_label - jl2_label)) jl2_label := output.length emit(1254, Register(i64, 13), Address(i8, 0, 7, 1)) # [prior .r7 + k .r1] emit(1254, Register(i64, 15), Address(i8, 0, 6, 1)) # [scanline .r6 + k .r1] # jump loop # 0886 JMP rel32 [64] [L] Jump short, RIP = RIP + 8-bit displacement sign extended to 64-bits emit( 886, Immediate(i32, loop_label - exit_label)) # exit: exit_label := output.length # Restore .r3 we clobbered. # 1638 POP r64 [64] Pop top of stack into r64; increment stack pointer. Cannot encode 32-bit operand size. emit(1638, Register(i64, 15)) emit(1638, Register(i64, 14)) emit(1638, Register(i64, 13)) emit(1638, Register(i64, 12)) emit(1638, Register(i64, 11)) emit(1638, Register(i64, 10)) emit(1638, Register(i64, 3)) # return emit(1901) # RET NEAR emit = (uid, args...): output.extend(asm.encode_ins(uid, args)) output = [] assemble() output = [] assemble()

At the point when I was going to write the actual algorithm the loop was supposed to run, I noticed that available registers almost ran out.

From here it's easy to run this in the place of the original paeth filter. We copy the assembler output into executable memory region and W^X it. I already have an utility for that:

buf = mman.Asmbuf(4096) buf.memcpy(Uint8Array(output)) buf.finalize()

Next we need a function handle that we can call. We get one from the FFI:

c_type = ffi.cfunc(ffi.int, [ ffi.voidp, ffi.voidp, ffi.voidp, # prior, scanline, data ffi.size_t,ffi.size_t,ffi.size_t]) # stride, offset, length c_func = ffi.cast(buf, c_type)

The filter array requires a little rewrite so that the length -field is exposed like above, but otherwise this already works and we can hot-patch the assembler program into the png module:

import png png.decode_filters[4] = c_func main_png()

Now we can run the original benchmark.

Assembler results

With the monkey-patched assembly module, our original benchmark now takes 0.6 seconds to run, and the vmprofile report is very closely resembling the removal of paeth decoder:

100.0% L5_200 100.0% fast_paeth.lc:5 100.0% .. L203_217 100.0% fast_paeth.lc:203 6.6% .... L230_241 6.6% fast_paeth.lc:230 92.3% .... L5_39 92.3% png.lc:5 11.0% ...... L149_166 11.9% png.lc:149 3.3% ........ L166_168 30.0% png.lc:166 79.1% ...... L53_78 85.7% png.lc:53 (line 231, with 289) 1.1% ........ L93_96 1.4% png.lc:93 24.2% ........ L98_104 30.6% png.lc:98 4.4% ........ L85_91 5.6% png.lc:85

I did double check that the paeth decoder keeps working. I made a hexadecimal dump from the image and compared the data with the optimized version. Both of them look like this:

35 51 45 ff 36 46 3b ff 1b 1e 19 ff 18 1b 16 ff 33 4f 43 ff 27 3a 2f ff 1b 1e 19 ff 18 1b 16 ff 2c 4a 40 ff 2e 4c 42 ff 2f 4d 43 ff 34 4f 46 ff 27 45 3b ff 29 47 3d ff 2a 48 3e ff 28 48 3a ff 27 47 3c ff 26 44 3a ff 25 43 39 ff 23 43 39 ff

You can tell that this data is at least plausibly correct, because of the full alpha channels separating the scanlines into columns of ff -symbols.

The closer inspection shows that the assembled version has some errors. It still runs through all the pixels so I'd believe the problem could be something subtle:

I won't use the assembled version right yet, so I leave the fix for later iterations.

Finally I also made a Vulkan demo that uses the JIT-optimized png module. Here's the screenshot:

Hand-written assembly is not practical

The assembler optimization is very successful considering that we don't even have to store any compiled executables here and we are still 4 times faster than within the JIT.

The problem is that the assembly module here cannot adjust for different calling conventions. It only works on Linux and on x86_64. There are other problems:

We have two versions to maintain. And if we discard the old version then we cannot return to JIT-evaluating the code in the case it is more practical for some possible usecase.

We had to rewrite the assembler version from the scratch when trying to optimize the program that way.

We cannot access all GC records from the assembly. Only the ones that we could naturally pass into C libraries.

The produced assembly code is calling-convention dependent and fixed to an operating system.

The practical solution would be to run type inference into the original program and then compile it down to assembly codes. It is the reason why I wrote the assembler in the first place, but it is also a subject for a second blog post.

Driving it through GLSL filter in Vulkan

Finally I made a vulkan application that uses the PNG loader. The full source code can be found at the samples/warpgpu/2_images.lc.

After the image has been loaded, we upload it to the GPU by memcopying into mapped location like this:

image.mem = gpu.mem.associate(image, [ "HOST_VISIBLE_BIT", "HOST_COHERENT_BIT" ]) data = image.mem.map(ffi.byte, image.offset, image.size) ffi.memcpy(data, png_image.data, png_image.data.length)

Of course, this is a bit wasteful because the PNG loader wouldn't really need anything else but two latest scanlines. It could write into the scanline and the image simultaneously to load the image without ever holding the fully decompressed image in the memory.

It was nice to start with the simple usecase first though.

Finally I checked the image in the GLSL output. It producing the effect by shifting the texture coordinates horizontally using output from a voronoi noise generator:

Extra: Lets try SSE & SIMD instructions!

At first I thought I wouldn't go this far, but I still had Sunday to go and everyone reading this post would expect me to get into SIMD. So let's go!

The feature detection is mandatory if you intend to use SIMD, but the first thing was to check what my CPU actually supports. So I ran the cat /proc/cpuinfo .

vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz stepping : 7 microcode : 0x70a cpu MHz : 2003.000 cache size : 6144 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority dtherm bugs : bogomips : 5666.40 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:

It seems that I can use all the instructions up to SSE4_1. Unfortunately the use of the SSE doesn't require the VEX prefix so I don't have a way to test whether I would get the VEX encoding right.

It's no wonder that stereotypical geeks are charismatic white males with fancy beards and no girlfriends! You would need very big wads of cash to stay updated on all the coolest hardware that pushes out of the factories at the constant feed.

Quick look into the instruction table tells that I have 565 SIMD operations available. When I discard the operations that work on floating point values, I end up with 361 operations to choose from.

Vectorizing the program

The first step is to look at the original program and see what we could do for it. To make this easier we break it to break it into extended basic blocks:

t = a + b p = t - c ta = p - a pa = abs ta tb = p - b pb = abs tb tc = p - c pc = abs tc c1 = le pa pb c2 = le pa pc c3 = and c1 c2 return a if c3 c4 = le pb pc return b if c4 return c

It is surprisingly lot of operations after all. What we are about to do is called auto vectorization in compilers. I guess what we are specifically about to do is called loop vectorization.

One important thing is to note that the algorithm has not been defined on bytes. This means that we might get the wrong result if we keep the bytes packed in our registers. We may have to unpack so we are going to need instructions for those. Also we are going to need the instructions to load values into the xmm registers. I think the following is sufficient:

PACKSSDW PACKSSDW PACKSSDW PACKSSDW PACKSSWB PACKSSWB PACKSSWB PACKSSWB PACKUSWB PACKUSWB PACKUSWB PACKUSWB PUNPCKHBW PUNPCKHBW PUNPCKHBW PUNPCKHBW PUNPCKHDQ PUNPCKHDQ PUNPCKHDQ PUNPCKHDQ PUNPCKHQDQ PUNPCKHQDQ PUNPCKHWD PUNPCKHWD PUNPCKHWD PUNPCKHWD PUNPCKLBW PUNPCKLBW PUNPCKLBW PUNPCKLBW PUNPCKLDQ PUNPCKLDQ PUNPCKLDQ PUNPCKLDQ PUNPCKLQDQ PUNPCKLQDQ PUNPCKLWD PUNPCKLWD PUNPCKLWD PUNPCKLWD MOVDQA MOVDQU

Second I've been looking up what I can find from our instruction table of valid operations. I found the following:

t = a + b PADDB, PADDD, PADDQ, PADDSB, PADDSW, PADDUSB, PADDUSW, PADDW, PADDW, PALIGNR [m64, m128] p = t - c PSUBB, PSUBD, PSUBQ, PSUBSB, PSUBSW, PSUBUSB, PSUBUSW, PSUBW ta = p - a pa = abs ta PABSB, PABSD PABSW [m64, m128] tb = p - b pb = abs tb tc = p - c pc = abs tc c1 = le pa pb PCMPGTB, PCMPGTW, PCMPGTD [m64, m128] c2 = le pa pc c3 = and c1 c2 PAND, PANDN [m64, m128] return a if c3 c4 = le pb pc return b if c4 return c

It appears that we have m128 for everything so we can use the SSE2 instructions everywhere here. We would also have PHADDW to add packed short integers horizontally.

The SIMD instructions do not have conditional operations, so instead we have to use a trick on them.

return a if c3 PAND, PANDN, POR, PXOR [m64, m128] c4 = le pb pc return b if c4 return c

Here we have everything we need to carry this out now. The final program:

result = 1168 MOVDQU [input + i - offset] a = 1164 MOVDQU [scanline + i - stride] b = 1168 MOVDQU [prior + i] c = 1164 MOVDQU [prior + i - stride] al = 1786 PUNPCKLBW a bl = 1786 PUNPCKLBW b cl = 1786 PUNPCKLBW c ah = 1772 PUNPCKHBW a bh = 1772 PUNPCKHBW b ch = 1772 PUNPCKHBW c pal = 1165 MOVDQA al pah = 1165 MOVDQA ah pal += 1395 PADDD bl pah += 1395 PADDD bh pal -= 1742 PSUBD al pah -= 1742 PSUBD ah pbl = 1165 MOVDQA pal pbh = 1165 MOVDQA pah pcl = 1165 MOVDQA pal pch = 1165 MOVDQA pah pal -= 1742 PSUBD al pbl -= 1742 PSUBD bl pcl -= 1742 PSUBD cl pah -= 1742 PSUBD ah pbh -= 1742 PSUBD bh pch -= 1742 PSUBD ch pal = 1369 PABSD pal pbl = 1369 PABSD pbl pcl = 1369 PABSD pcl pah = 1369 PABSD pah pbh = 1369 PABSD pbh pch = 1369 PABSD pch c1l = 1472 PCMPGTD pbl pal c2l = 1472 PCMPGTD pcl pal c4l = 1472 PCMPGTD pcl pbl c1h = 1472 PCMPGTD pbh pah c2h = 1472 PCMPGTD pch pah c4h = 1472 PCMPGTD pch pbh c3l = 1427 PAND c1l c2l c3h = 1427 PAND c1h c2h bl = 1427 PAND bl c4l bh = 1427 PAND bh c4h cl = 1431 PANDN cl, c4l ch = 1431 PANDN ch, c4h bl = 1650 POR bl, cl bh = 1650 POR bh, ch al = 1427 PAND al, c3l ah = 1427 PAND ah, c3h bl = 1431 PANDN bl, c3l bh = 1431 PANDN bh, c3h al = 1650 POR al, bl ah = 1650 POR ah, bh delta = 1387 PACKUSWB al, ah result += 1391 PADDB delta 1167 MOVDQU [scanline], result

It's ridiculous! 54 operations that run on 16 bytes of data at once. Next we should allocate registers and produce the program. On Linux we are free to use the whole range of XMM registers without having to store them, so we do so.

simd_paeth_assembly = (emit, input_addr, a_addr, b_addr, c_addr, scanline_addr): xmm0 = Register(m128, 0) xmm1 = Register(m128, 1) xmm2 = Register(m128, 2) xmm3 = Register(m128, 3) xmm4 = Register(m128, 4) xmm5 = Register(m128, 5) xmm6 = Register(m128, 6) xmm7 = Register(m128, 7) xmm8 = Register(m128, 8) xmm9 = Register(m128, 9) xmm10 = Register(m128, 10) xmm11 = Register(m128, 11) xmm12 = Register(m128, 12) xmm13 = Register(m128, 13) xmm14 = Register(m128, 14) xmm15 = Register(m128, 15) # register allocation result = xmm0 al = xmm1 bl = xmm2 cl = xmm3 ah = xmm4 bh = xmm5 ch = xmm6 pal = xmm7 pbl = xmm8 pcl = xmm9 pah = xmm10 pbh = xmm11 pch = xmm12 # result = 1164 MOVDQU [input + i - offset] emit(1168, result, input_addr) # a = 1168 MOVDQU [scanline + i - stride] emit(1164, al, a_addr) # b = 1168 MOVDQU [prior + i] emit(1168, bl, b_addr) # c = 1168 MOVDQU [prior + i - stride] emit(1164, cl, c_addr) # ah = 1772 PUNPCKHBW a # bh = 1772 PUNPCKHBW b # ch = 1772 PUNPCKHBW c emit(1772, ah, al) emit(1772, bh, bl) emit(1772, ch, cl) # al = 1786 PUNPCKLBW a # bl = 1786 PUNPCKLBW b # cl = 1786 PUNPCKLBW c emit(1786, al, al) emit(1786, bl, bl) emit(1786, cl, cl) # pal = 1165 MOVDQA al # pah = 1165 MOVDQA ah emit(1165, pal, al) emit(1165, pah, ah) # pal += 1395 PADDD bl # pah += 1395 PADDD bh emit(1395, pal, bl) emit(1395, pah, bh) # pal -= 1742 PSUBD al # pah -= 1742 PSUBD ah emit(1742, pal, al) emit(1742, pah, ah) # pbl = 1165 MOVDQA pal # pbh = 1165 MOVDQA pah # pcl = 1165 MOVDQA pal # pch = 1165 MOVDQA pah emit(1165, pbl, pal) emit(1165, pbh, pah) emit(1165, pcl, pcl) emit(1165, pch, pch) # pal -= 1742 PSUBD al # pbl -= 1742 PSUBD bl # pcl -= 1742 PSUBD cl # pah -= 1742 PSUBD ah # pbh -= 1742 PSUBD bh # pch -= 1742 PSUBD ch emit(1742, pal, al) emit(1742, pbl, bl) emit(1742, pcl, cl) emit(1742, pah, ah) emit(1742, pbh, bh) emit(1742, pch, ch) # pal = 1369 PABSD pal # pbl = 1369 PABSD pbl # pcl = 1369 PABSD pcl # pah = 1369 PABSD pah # pbh = 1369 PABSD pbh # pch = 1369 PABSD pch emit(1369, pal, pal) emit(1369, pbl, pbl) emit(1369, pcl, pcl) emit(1369, pah, pah) emit(1369, pbh, pbh) emit(1369, pch, pch) # We would need at least 4 registers more, so # we need to figure out which ones we can reuse. c1l = xmm13 c1h = xmm14 c2l = xmm15 # If we do some reordering here, we can # reuse pal # c1l = 1472 PCMPGTD pbl pal # c1h = 1472 PCMPGTD pbh pah # c2l = 1472 PCMPGTD pcl pal # c2h = 1472 PCMPGTD pch pah emit(1164, c1l, pbl) emit(1472, c1l, pal) emit(1164, c2l, pcl) emit(1472, c2l, pal) c2h = pal emit(1164, c2h, pch) emit(1472, c2h, pah) emit(1164, c1h, pbh) emit(1472, c1h, pah) # c4l = 1472 PCMPGTD pcl pbl # c4h = 1472 PCMPGTD pch pbh # We can reuse pcl/pch for c4l, c4h c4l = pcl # emit(1164, c4l, pcl) c4h = pch # emit(1164, c4h, pch) emit(1472, c4l, pbl) emit(1472, c4h, pbh) # c3l = 1427 PAND c1l c2l # c3h = 1427 PAND c1h c2h # We can reuse c1* for c3* c3l = c1l c3h = c1h emit(1427, c3l, c2l) emit(1427, c3h, c2h) # bl = 1427 PAND bl c4l # bh = 1427 PAND bh c4h emit(1427, bl, c4l) emit(1427, bh, c4h) # cl = 1431 PANDN cl, c4l # ch = 1431 PANDN ch, c4h emit(1431, cl, c4l) emit(1431, ch, c4h) # bl = 1650 POR bl, cl # bh = 1650 POR bh, ch emit(1650, bl, cl) emit(1650, bh, ch) # al = 1427 PAND al, c3l # ah = 1427 PAND ah, c3h emit(1427, al, c3l) emit(1427, ah, c3h) # bl = 1431 PANDN bl, c3l # bh = 1431 PANDN bh, c3h emit(1431, bl, c3l) emit(1431, bh, c3h) # al = 1650 POR al, bl # ah = 1650 POR ah, bh emit(1650, al, bl) emit(1650, ah, bh) # delta = 1387 PACKUSWB al, ah delta = al emit(1387, delta, ah) # result += 1391 PADDB delta emit(1391, result, delta) # 1168 MOVDQU [scanline + i], result emit(1167, scanline_addr, result) return result

The dump of the above program is 288 bytes long. A quick look with ndisasm revealed that the assembler works. Now we have to install the above thing into a loop to have it run. We can reuse most of our old loop:

assemble = (): # 1811 PUSH r64 [64] Push r64. # we no longer clobber any registers we should save. # i = 0 .r0 # 5973 XOR m64 r64 [64] r/m64 XOR r64. emit(5973, Register(i64, 0), Register(i64, 0)) # j = offset .arg_4(8) # 1289 NEG r64 [64] Two's complement negate r/m64. # 74 ADD r64 r64 [64] Add r64 to r/m64. # k = j - stride .r1 emit(1289, Register(i64, 1)) emit( 74, Register(i64, 1), Register(i64, 8)) # loop: loop_label := output.length # 0475 CMP m64 r64 [64] Compare r64 with r/m64. # if i >= length goto exit emit( 475, Register(i64, 0), arg_5) # 0862 JAE rel32 [64] [L] Jump short if above or equal (CF=0). emit( 862, Immediate(i32, exit_label - ja_label)) ja_label := output.length simd_paeth_assembly(emit, Address(m128, 0, 2, 0), # input = [input .r2 + i .r0] Address(m128, 0, 6, 1), # a = [scanline .r6 + k .r1] Address(m128, 0, 7, 8), # b = [prior .r7 + j .r8] Address(m128, 0, 7, 1), # c = [prior .r7 + k .r1] Address(m128, 0, 6, 8)) # output = [scanline .r6 + j .r8] # i, j, k += 16 emit( 72, Register(i64, 0), Immediate(i8, 16)) emit( 72, Register(i64, 8), Immediate(i8, 16)) emit( 72, Register(i64, 1), Immediate(i8, 16)) # jump loop # 0886 JMP rel32 [64] [L] Jump short, RIP = RIP + 8-bit displacement sign extended to 64-bits emit( 886, Immediate(i32, loop_label - exit_label)) # exit: exit_label := output.length # return emit(1901) # RET NEAR

Now, unlike the original assembly program, we can't use this new implementation in isolation because it only runs through 16-byte chunks at a time.

We can't use this new implementation just as it because it churns through 16-byte chunks in one swoop. Therefore we need the JIT version to meet up the beginning and end, so these two have to be combined:

make_new_filter = (optimized, ordinary): return (prior, scanline, input, stride, offset, length): # max(0, offset - stride) + 16 must be filled. fast_offset = max(max(0, offset-stride) + 16, offset) fast_index = fast_offset - offset remain_length = length - fast_index last_length = remain_length & 16 fast_length = remain_length - last_length last_index = length - last_length last_offset = offset + last_index ordinary(prior, scanline, input, stride, offset, fast_index) optimized(prior, scanline, input[fast_index .:], stride, fast_offset, fast_length) ordinary(prior, scanline, input[last_index .:], stride, last_offset, last_length)

That was already a bit tricky to write.

Debugging

When we run the code we notice it produces a wrong output. Here's the first two scanlines of the JIT-only version:

7f 83 82 ff 7c 80 7f ff 7f 83 82 ff 7b 7f 7e ff ... 48 52 2f ff 4f 59 36 ff 50 5a 37 ff 4e 58 35 ff 7e 82 81 ff 7b 7f 7e ff 7e 82 81 ff 7a 7e 7d ff ... 50 5a 37 ff 57 61 3e ff 57 61 3e ff 52 5c 39 ff

And here's our JIT+SIMD version:

7f 83 82 ff 7c 80 7f ff 7f 83 82 ff 7b 7f 7e ff ... 48 52 2f ff 4f 59 36 ff 50 5a 37 ff 4e 58 35 ff 7e 82 81 ff 7b 7f 7e ff 7e 82 81 ff 7a 7e 7d ff ... 07 07 07 00 0e 0e 0e 00 0e 0e 0e 00 09 09 09 00

It looks a lot like we would have something in our paeth. The first line looks okay, but actually it's running a different filter there. Our filter runs on the second row. Fortunately we can define a software breakpoint.

# 852 INT 3 [64] [L] Interrupt 3-trap to debugger. emit( 852, null)

But maybe I haste here because my display shows only 16 bytes from the beginning and the end! This means that on the left side our optimized program hasn't been running at all.

So our filter is supposed to fill the first 16 bytes so that the actual program can start there like this:

pre-fill: AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA scanline+k: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 scanline+j: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

But then we actually start the optimized routine like this:

pre-fill: AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA scanline+k: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 scanline+j: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Oops. Here's the fixed version.

make_new_filter = (optimized, ordinary): return (prior, scanline, input, stride, offset, length): if length < 32 # if it's so small chunk that our thing breaks return ordinary(prior, scanline, input, stride, offset, length) # max(0, offset - stride) + 16 must be filled. fill_offset = max(max(0, offset-stride) + 16, offset) fast_offset = max(offset, stride) fast_index = fast_offset - offset remain_length = length - fast_index last_length = remain_length & 15 fast_length = remain_length - last_length last_index = length - last_length last_offset = offset + last_index ordinary(prior, scanline, input, stride, offset, fill_offset - offset) optimized(prior, scanline, input[fast_index .:], stride, fast_offset, fast_length) ordinary(prior, scanline, input[last_index .:], stride, last_offset, last_length)

It still gives the wrong results but they are more apparent now. Next we use that trap instruction I shown and run the following in the GDB:

display /x $xmm0.uint128 display /x $xmm1.uint128 display /x $xmm2.uint128 display /x $xmm3.uint128 display /x $xmm4.uint128 display /x $xmm5.uint128 display /x $xmm6.uint128 display /x $xmm7.uint128 display /x $xmm8.uint128 display /x $xmm8.uint128 display /x $xmm9.uint128 display /x $xmm10.uint128 display /x $xmm11.uint128 display /x $xmm12.uint128 display /x $xmm13.uint128 display /x $xmm14.uint128 display /x $xmm15.uint128 layout asm

After the MOVDQU instructions the xmm0-xmm3 looks like this.

1: /x $xmm0.uint128 = 0x00fefefe00ffffff00ffffff00ffffff 2: /x $xmm1.uint128 = 0xff7d7e7aff81827eff7e7f7bff81827e 3: /x $xmm2.uint128 = 0xff7d7e7aff7e7f7bff82837fff7f807c 4: /x $xmm3.uint128 = 0xff7e7f7bff82837fff7f807cff82837f

The xmm0 looks very familiar. I think it'll be apparent very soon why this code produces the wrong results.

After the punpck{lbw,hbw} have run, the register dump looks like the following:

1: /x $xmm0.uint128 = 0x00fefefe00ffffff00ffffff00ffffff 2: /x $xmm1.uint128 = 0xffff7e7e7f7f7b7bffff818182827e7e 3: /x $xmm2.uint128 = 0xffff828283837f7fffff7f7f80807c7c 4: /x $xmm3.uint128 = 0xffff7f7f80807c7cffff828283837f7f 5: /x $xmm4.uint128 = 0xff007d007e007a00ff00810082007e00 6: /x $xmm5.uint128 = 0xff007d007e007a00ff007e007f007b00 7: /x $xmm6.uint128 = 0xff007e007f007b00ff00820083007f00

It is suddenly painfully apparent that the destination and source register in the unpack operation cannot be the same. So we have to fix those lines like this:

# ah = 1817 PXOR ah # bh = 1817 PXOR bh # ch = 1817 PXOR ch # al = 1817 PXOR al # bl = 1817 PXOR bl # cl = 1817 PXOR cl emit(1817, al, al) emit(1817, bl, bl) emit(1817, cl, cl) emit(1817, ah, ah) emit(1817, bh, bh) emit(1817, ch, ch) # ah = 1772 PUNPCKHBW a # bh = 1772 PUNPCKHBW b # ch = 1772 PUNPCKHBW c emit(1772, ah, pal) emit(1772, bh, pbl) emit(1772, ch, pcl) # al = 1786 PUNPCKLBW a # bl = 1786 PUNPCKLBW b # cl = 1786 PUNPCKLBW c emit(1786, al, pal) emit(1786, bl, pbl) emit(1786, cl, pcl)

Another iteration of the first entry with a debugger reveals that we calculate a + b - a rather than a + b - c like we should. This fixes that:

# pal -= 1742 PSUBD cl # pah -= 1742 PSUBD ch emit(1742, pal, cl) emit(1742, pah, ch)

It seems that at the end we end up adding with zero, so that we are actually doing nothing in this program. The only reasonable explanation here is that we mess up the conditionals somehow.

After the end of the conditional block, the result looks like this:

1: /x $xmm0.uint128 = 0x00fefefe00ffffff00ffffff00ffffff (result) 2: /x $xmm1.uint128 = 0xff007e007f007b00ff00810082007e00 (al) 3: /x $xmm2.uint128 = 0xff00820083007f00ff007f0080007c00 (bl) 4: /x $xmm3.uint128 = 0xff007f0080007c00ff00820083007f00 (cl) 5: /x $xmm4.uint128 = 0xff007d007e007a00ff00810082007e00 (ah) 6: /x $xmm5.uint128 = 0xff007d007e007a00ff007e007f007b00 (bh) 7: /x $xmm6.uint128 = 0xff007e007f007b00ff00820083007f00 (ch) 8: /x $xmm7.uint128 = 0xffffffffffffffffffffffffffffffff (c2h) 10: /x $xmm8.uint128 = 0x00000100010001000000010001000100 (pbl) 11: /x $xmm9.uint128 = 0xffffffffffffffffffffffffffffffff (c4l) 12: /x $xmm10.uint128 = 0x00000100010001000000040004000400 (pah) 13: /x $xmm11.uint128 = 0x00000100010001000000010001000100 (pbh) 14: /x $xmm12.uint128 = 0xffffffffffffffffffffffffffffffff (c4h) 15: /x $xmm13.uint128 = 0x00000000000000000000000000000000 (c3l) 16: /x $xmm14.uint128 = 0x00000000000000000000000000000000 (c3h) 17: /x $xmm15.uint128 = 0xffffffffffffffffffffffffffffffff (c2l)

The labels reveal that we have the following state:

al, bl, cl, ah, bh, ch c3 = false c4 = true

This should mean that the result b is selected. The following instructions should correctly preserve the result.

# bl = 1427 PAND bl c4l # bh = 1427 PAND bh c4h emit(1427, bl, c4l) emit(1427, bh, c4h)

They did so. Respectively these next ones should zero the cl and ch.

# cl = 1431 PANDN cl, c4l # ch = 1431 PANDN ch, c4h emit(1431, cl, c4l) emit(1431, ch, c4h)

I think I messed up something again.. Reading out on this reveals that the PANDN inverts the destination and then does AND. It's not exactly what we expected. We have to flip them over and merge with the c3 and c4.

The result is still completely wrong:

7f 83 82 ff 7c 80 7f ff 7f 83 82 ff 7b 7f 7e ff ... 48 52 2f ff 4f 59 36 ff 50 5a 37 ff 4e 58 35 ff 7e 82 81 ff fe ff fe 00 fe ff ff 00 fe fe fe 00 ... 06 06 06 00 0d 0d 0d 00 0d 0d 0d 00 08 08 08 00

One of the culprits could be the PCMPGTD . We use it like this:

c1 = PCMPGTD pb pa

If this actually means 'greater than', then the code is:

c1 = pb > pa

And the condition was:

if pa <= pb and pa <= pc return a elif pb <= pc return b else return c

So there we have our problem. I had to think about this a bit and transform it into the correct form.

if not (pa > pb or pa > pc) return a elif not pb > pc return b else return c

Even then the code is not correct yet. Let's look at the input next. It looks weird because it's upside down. Let's label them and compare them with what we know.

input: /x $xmm0.uint128 = 0x 00 fe fe fe 00 ff ff ff 00 ff ff ff 00 ff ff ff a: /x $xmm7.uint128 = 0x ff 7d 7e 7a ff 81 82 7e ff 7e 7f 7b ff 81 82 7e b: /x $xmm8.uint128 = 0x ff 7d 7e 7a ff 7e 7f 7b ff 82 83 7f ff 7f 80 7c c: /x $xmm9.uint128 = 0x ff 7e 7f 7b ff 82 83 7f ff 7f 80 7c ff 82 83 7f prior: 7f 83 82 ff 7c 80 7f ff 7f 83 82 ff 7b 7f 7e ff ... 48 52 2f ff 4f 59 36 ff 50 5a 37 ff 4e 58 35 ff scanline: 7e 82 81 ff 7b 7f 7e ff 7e 82 81 ff 7a 7e 7d ff ... 50 5a 37 ff 57 61 3e ff 57 61 3e ff 52 5c 39 ff

This is precise. Next the data is packed into the 6 registers. We see them being cleaned:

al: /x $xmm1.uint128 = 0x00000000000000000000000000000000 bl: /x $xmm2.uint128 = 0x00000000000000000000000000000000 cl: /x $xmm3.uint128 = 0x00000000000000000000000000000000 ah: /x $xmm4.uint128 = 0x00000000000000000000000000000000 bh: /x $xmm5.uint128 = 0x00000000000000000000000000000000 ch: /x $xmm6.uint128 = 0x00000000000000000000000000000000

Then they get filled up with data:

al: /x $xmm1.uint128 = 0xff00 7e00 7f00 7b00 ff00 8100 8200 7e00 bl: /x $xmm2.uint128 = 0xff00 8200 8300 7f00 ff00 7f00 8000 7c00 cl: /x $xmm3.uint128 = 0xff00 7f00 8000 7c00 ff00 8200 8300 7f00 ah: /x $xmm4.uint128 = 0xff00 7d00 7e00 7a00 ff00 8100 8200 7e00 bh: /x $xmm5.uint128 = 0xff00 7d00 7e00 7a00 ff00 7e00 7f00 7b00 ch: /x $xmm6.uint128 = 0xff00 7e00 7f00 7b00 ff00 8200 8300 7f00

So the problem appears to be that we do the wrong kind of unpacking operation here. We can fix it with PSRLW which shifts bytes right.

The beginning of the sequnce started to look right at this point, but it got the rest wrong. I started to realise there is something fairly wrong here.

Ok. So we fill our first run like this:

7e 82 81 ff .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Then we jump ahead by 16 bytes and we start reading the input from here:

7e 82 81 ff .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^

You know what? At this point I realised I cannot do it this way. I can only compute contents on one pixel parallel using this parallelization technique. And it's messy considering what it does.

I found yet some more things I did wrong, such as the use of double word operations when I should have used operations on words. After finding those I started getting results that started to seem plausible but compared to the correct output they were still incorrect.

Lets leave it to the backburner

This was the first time when I've studied the SIMD instructions in detail. The outcome was a complete mess but I learned so much in the process that I'm not even frustrated.

The techniques presented here obviously work, but I need to write a compiler to make the machine code solution more practical as a choice.