Optimizing Lever PNG loading

The following code took over 9.4 seconds to run:

import png

for i in range(10)
    image = png.read_file(dir ++ "frants_boe_villender_test.png")

The image to load was a 640x480 RGBA image from a painting of Frants Diderik Bøe, who specialized in still life and landscapes.

the test

The program became slow when I added the scanline filtering, so I would expect to know what slows it down. But I am really not on the mood of guessing wrong, so let's run the code through a profiler like this:

import png, fs, vmprof

vmprof_fd = fs.open(dir ++ "png.vmprof",
    fs.WRONLY | fs.TRUNC | fs.CREAT)
vmprof.enable(vmprof_fd, 0.0001)

for i in range(10)
    image = png.read_file(dir ++ "frants_boe_villender_test.png")

vmprof.disable()
vmprof_fd.close()

The profiler output:

100.0%  L3_16  100.0%  png_load.lc:3     (line 5 at this post)
100.0% .. L5_39  100.0%  png.lc:5
 99.6% .... L53_82  99.6%  png.lc:53
  0.1% ...... L87_90  0.1%  png.lc:87
  0.1% ...... L90_96  0.1%  png.lc:90
 49.4% ...... L99_101  49.6%  png.lc:99
 35.4% ........ L104_117  71.6%  png.lc:104
 16.9% .......... L117_131  47.9%  png.lc:117
  0.6% ...... L96_99  0.6%  png.lc:96
  0.3% .... L142_159  0.3%  png.lc:142
  0.1% ...... L159_161  37.5%  png.lc:159

At the png.lc:53 we have the code that runs a decoding loop:

read = (self, data):
    data = self.z.decompress(data)
    for byte in data
        if self.new_scanline
            self.filter = decode_filters[byte]
            self.new_scanline = false
            continue
        if self.l == 0
            u = 0
            c = 0
            if self.x - self.stride >= self.l
                l = self.data[self.x - self.stride]
            else
                l = 0
        else
            u = self.data[self.x - self.y_stride]
            if self.x - self.stride >= self.l
                l = self.data[self.x - self.stride]
                c = self.data[self.x - self.y_stride - self.stride]
            else
                l = 0
                c = 0
        self.data[self.x] = self.filter(byte, l, u, c)
        self.x += 1
        if self.x >= self.r
            self.l += self.y_stride
            self.r += self.y_stride
            self.new_scanline = true

It tells that almost half of the time is spent in the png.lc:99.

decode_filters = {
    4: (byte, l, u, c):
        return byte + paeth_predictor(l, u, c)
}

paeth_predictor = (a, b, c):
    p = a + b - c
    pa = abs_int(p - a)
    pb = abs_int(p - b)
    pc = abs_int(p - c)
    if pa <= pb and pa <= pc
        return a
    elif pb <= pc
        return b
    else
        return c

# TODO: make them into multimethods
abs_int = (a):
    if a < 0
        return -a
    return a

The Line 60 is the paeth_predictor. And 73 is the abs_int.

Motive

There are many C libraries meant for loading PNG images, that I could have integrated into my runtime.

Lever is a gradually designed programming language. The observations made about the behavior of the language are used to improve it even further. The intention is to leave in the possibility for changing the language, so that the language can refine over time to meet on newly made discoveries.

One of my objectives with Lever is to make it into a high level programming language that obtains good runtime performance own its own, and then supply it with utilities to optimize those implementations to match the performance that is obtainable with the C language.

To do this anywhere else, we need to get started with something simple, and it would be preferable to be practical. Therefore I decided to write my own PNG implementation in Lever.

First the abs multimethod

The abs_int is a crutch that I made because Lever's abs -function only operates on floats. The limitation isn't very elegant though. I've intended to remove it when it becomes more relevant.

When you're optimizing code, it would be preferable to start with a version that doesn't have any crutches originating from unimplemented features in the programming language's runtime, so the first obvious thing here is to remove the abs_int and replace it with properly implemented abs multimethod.

The implementation of abs is in the runtime/vectormath.py. As pointed out by the documentation for abs(). We replace it with the multimethod implementation:

abs_ = Multimethod(1)

@abs_.multimethod_s(Float)
def abs_float(f):
    return Float(-f.number) if f.number < 0.0 else f.number

@abs_.multimethod_s(Integer)
def abs_int(i):
    return Integer(-i.value) if i.value < 0 else i

The Lever runtime needs to be retranslated for the changes to take effect. We can remove the abs_int and replace it with abs in the paeth_predictor.

Meanwhile we have time to think about the paeth predictor. When you look at the variable flow, every math operation appears to require the previous value. This is an important observation because we are running on a superscalar processor.

Although we are still somewhere on the dark side of the moon when it comes to the instruction-level optimizations. When I run the program again, it now takes 8 seconds to run.

100.0%  L3_16  100.0%  png_load.lc:3        (line 5)
100.0% .. L5_39  100.0%  png.lc:5
 99.4% .... L53_82  99.4%  png.lc:53        (line 27)
  0.9% ...... L96_99  0.9%  png.lc:96       
  0.2% ...... L90_96  0.2%  png.lc:90
 32.4% ...... L99_101  32.6%  png.lc:99     (line 56)
 15.3% ........ L104_125  47.1%  png.lc:104 (line 60)
  0.1% ...... L87_90  0.1%  png.lc:87
  0.4% .... L136_153  0.4%  png.lc:136
  0.3% ...... L153_155  62.5%  png.lc:153
  0.1% .... L125_136  0.1%  png.lc:125
  0.1% ...... L153_155  100.0%  png.lc:153

So in total we shaved 17% with that small change. But it is still the paeth predictor and the PNG decoding loop that's slowing us down.

Checking the JIT traces

The JIT trace is perhaps showing us something interesting. The following command extracts it for us and places it into the pnglog:

PYPYLOG=jit-log-opt:pnglog lever samples/png/png_load.lc

This is the first time when I am using the JIT trace to optimize a program, so my function for the printable location only returned the program counter:

debug_merge_point(0, 0, 'pc=25 setloc module=<module png>')
debug_merge_point(0, 0, 'pc=29 getglob module=<module png>')
debug_merge_point(0, 0, 'pc=32 getloc module=<module png>')
debug_merge_point(0, 0, 'pc=35 getloc module=<module png>')
debug_merge_point(0, 0, 'pc=38 getglob module=<module png>')
debug_merge_point(0, 0, 'pc=41 call module=<module png>')
+1119: i119 = int_sub(i118, i110)

I adjusted it to provide line counts too. I'm sure it needs something better later on:

debug_merge_point(0, 0, 'pc=25 setloc module=<module png>:105')
debug_merge_point(0, 0, 'pc=29 getglob module=<module png>:105')
debug_merge_point(0, 0, 'pc=32 getloc module=<module png>:106')
debug_merge_point(0, 0, 'pc=35 getloc module=<module png>:106')
debug_merge_point(0, 0, 'pc=38 getglob module=<module png>:106')
debug_merge_point(0, 0, 'pc=41 call module=<module png>:106')
+1119: i119 = int_sub(i118, i110)

After dropping out lots of debug_merge_points, I see that the paeth predictor ends up like this:

+916: guard_nonnull_class(p3, ConstClass(Integer), descr=<Guard0x7effcc4a9db0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+941: guard_nonnull_class(p5, ConstClass(Integer), descr=<Guard0x7effcc4a9e08>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+966: guard_not_invalidated(descr=<Guard0x7effcc475e20>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+980: p108 = getfield_gc_r(ConstPtr(ptr107), descr=<FieldP space.multimethod.Multimethod.inst_version 48>)
+991: guard_value(p108, ConstPtr(ptr109), descr=<Guard0x7effcc475e60>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1000: i110 = getfield_gc_i(p3, descr=<FieldS space.numbers.Integer.inst_value 8 pure>)
+1011: i111 = getfield_gc_i(p5, descr=<FieldS space.numbers.Integer.inst_value 8 pure>)
+1022: i112 = int_add(i110, i111)
+1032: guard_nonnull_class(p7, ConstClass(Integer), descr=<Guard0x7effcc4a9e60>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1071: p115 = getfield_gc_r(ConstPtr(ptr114), descr=<FieldP space.multimethod.Multimethod.inst_version 48>)
+1089: guard_value(p115, ConstPtr(ptr116), descr=<Guard0x7effcc475ea0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1098: i117 = getfield_gc_i(p7, descr=<FieldS space.numbers.Integer.inst_value 8 pure>)
+1109: i118 = int_sub(i112, i117)
+1119: i119 = int_sub(i118, i110)
+1147: p121 = getfield_gc_r(ConstPtr(ptr120), descr=<FieldP space.multimethod.Multimethod.inst_version 48>)
+1165: guard_value(p121, ConstPtr(ptr122), descr=<Guard0x7effcc475ee0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1174: i124 = int_lt(i119, 0)
+1182: guard_false(i124, descr=<Guard0x7effcc475f20>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1188: i125 = int_sub(i118, i111)
+1198: i127 = int_lt(i125, 0)
+1202: guard_false(i127, descr=<Guard0x7effcc475f60>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1208: i128 = int_sub(i118, i117)
+1218: i130 = int_lt(i128, 0)
+1222: guard_false(i130, descr=<Guard0x7effcc475fa0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1235: p132 = getfield_gc_r(ConstPtr(ptr131), descr=<FieldP space.multimethod.Multimethod.inst_version 48>)
+1246: guard_value(p132, ConstPtr(ptr133), descr=<Guard0x7effcc4c0020>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1255: i134 = int_le(i119, i125)
+1262: guard_true(i134, descr=<Guard0x7effcc4c0060>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1268: i135 = int_le(i119, i128)
+1275: guard_true(i135, descr=<Guard0x7effcc4c00a0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1281: leave_portal_frame(0)

Superfluous guards

I even found something I should no longer have there:

+980: p108 = getfield_gc_r(ConstPtr(ptr107), descr=<FieldP space.multimethod.Multimethod.inst_version 48>)
+991: guard_value(p108, ConstPtr(ptr109), descr=<Guard0x7effcc475e60>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1071: p115 = getfield_gc_r(ConstPtr(ptr114), descr=<FieldP space.multimethod.Multimethod.inst_version 48>)
+1089: guard_value(p115, ConstPtr(ptr116), descr=<Guard0x7effcc475ea0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1147: p121 = getfield_gc_r(ConstPtr(ptr120), descr=<FieldP space.multimethod.Multimethod.inst_version 48>)
+1165: guard_value(p121, ConstPtr(ptr122), descr=<Guard0x7effcc475ee0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1235: p132 = getfield_gc_r(ConstPtr(ptr131), descr=<FieldP space.multimethod.Multimethod.inst_version 48>)
+1246: guard_value(p132, ConstPtr(ptr133), descr=<Guard0x7effcc4c0020>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]

Every guard in JIT converts into a conditional and a conditional branch. They ensure that the program diverges from the fast path if it's not applicable. Making the version number pseudoimmutable should help a bit here:

class Multimethod(Object):
    _immutable_fields_ = ['arity', 'multimethod_table', 'interface_table', 'default?', 'version?']

The pseudoimmutability is marked when we expect that a value does not change often. The version number changes when the multimethod is updated. From the perspective of hot loops it is never changed.

We still have lots of guards here.

+820: guard_nonnull_class(p3, ConstClass(Integer), descr=<Guard0x7f9c21d1a1d8>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+845: guard_nonnull_class(p5, ConstClass(Integer), descr=<Guard0x7f9c21d1a230>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+870: guard_not_invalidated(descr=<Guard0x7f9c21ccbf60>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+870: i107 = getfield_gc_i(p3, descr=<FieldS space.numbers.Integer.inst_value 8 pure>)
+888: i108 = getfield_gc_i(p5, descr=<FieldS space.numbers.Integer.inst_value 8 pure>)
+892: i109 = int_add(i107, i108)
+902: guard_nonnull_class(p7, ConstClass(Integer), descr=<Guard0x7f9c21d1a288>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+935: i111 = getfield_gc_i(p7, descr=<FieldS space.numbers.Integer.inst_value 8 pure>)
+939: i112 = int_sub(i109, i111)
+956: i113 = int_sub(i112, i107)
+970: i115 = int_lt(i113, 0)
+974: guard_false(i115, descr=<Guard0x7f9c21ccbfa0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+980: i116 = int_sub(i112, i108)
+997: i118 = int_lt(i116, 0)
+1001: guard_false(i118, descr=<Guard0x7f9c21d20020>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1007: i119 = int_sub(i112, i111)
+1017: i121 = int_lt(i119, 0)
+1021: guard_false(i121, descr=<Guard0x7f9c21d20060>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1027: i122 = int_le(i113, i116)
+1034: guard_true(i122, descr=<Guard0x7f9c21d200a0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1040: i123 = int_le(i113, i119)
+1047: guard_true(i123, descr=<Guard0x7f9c21d200e0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1053: leave_portal_frame(0)

And we are down to 7.1 seconds:

100.0%  L3_16  100.0%  png_load.lc:3        (line 5)
100.0% .. L5_39  100.0%  png.lc:5
 99.5% .... L53_82  99.5%  png.lc:53        (line 27)
  0.7% ...... L96_99  0.7%  png.lc:96
  0.1% ...... L87_90  0.1%  png.lc:87
 29.9% ...... L99_101  30.0%  png.lc:99     (line 56)
 12.3% ........ L104_128  41.2%  png.lc:104 (line 60)
  0.1% ...... L90_96  0.1%  png.lc:90
  0.2% .... L139_156  0.2%  png.lc:139
  0.1% ...... L156_158  25.0%  png.lc:156
  0.1% .... L40_53  0.1%  png.lc:40

Somewhat illogically, the time spent in the paeth predictor increased!

Conditional branching

Another trick to try would be to remove the unnecessary branching by coalescing the conditionals. To do it we need a tool to do it. A select() function does fine.

@builtin
@signature(Boolean, Object, Object)
def select(cond, a, b):
    return a if cond == true else b

The optimizer will likely convert this thing into a conditional move, or so I hope. In every case it should eliminate additional branches that would turn into guards. The paeth_predictor looks like this now:

paeth_predictor = (a, b, c):
    p = a + b - c
    pa = abs(p - a)
    pb = abs(p - b)
    pc = abs(p - c)
    return select(pa <= pb and pa <= pc,
        a, select(pb <= pc, b, c))

The results were perhaps illogical. This was not any faster. When I looked into the trace, the guards are still there. Maybe our implementation has the wrong shape? Lets do it in a way that cannot be interpreted wrong:

@builtin
@signature(Boolean, Object, Object)
def select(cond, a, b):
    return [b, a][cond == true]

Even then it does not have the desired effect. Forming a new bytecode instruction that does this shouldn't have any effect either because the call is constant folded away anyway.

Reader loop

One problem in our reader loop is that we are doing whole lot of work inside it. We are checking bounds and fetching stuff for the filter:

data = self.z.decompress(data)
for byte in data
    if self.new_scanline
        self.filter = decode_filters[byte]
        self.new_scanline = false
        continue
    if self.l == 0
        u = 0
        c = 0
        if self.x - self.stride >= self.l
            l = self.data[self.x - self.stride]
        else
            l = 0
    else
        u = self.data[self.x - self.y_stride]
        if self.x - self.stride >= self.l
            l = self.data[self.x - self.stride]
            c = self.data[self.x - self.y_stride - self.stride]
        else
            l = 0
            c = 0
    # run the filter and update the bounds
    self.data[self.x] = self.filter(byte, l, u, c)
    self.x += 1
    if self.x >= self.r
        self.l += self.y_stride
        self.r += self.y_stride
        self.new_scanline = true

We are doing that for every 1 228 800 bytes. Yet we only have 480 scanlines!

We could instead do:

prior    = self.prior
scanline = self.scanline
index    = self.index
data = self.z.decompress(data)
while data.length > 0
    if index == 0 
        self.filter = decode_filters[data[0]]
        data = data[1 .:]
    L = min(self.y_stride - index, data.length)
    for i in range(index, index+L)
        if i < self.stride
            l = 0
            c = 0
        else
            l = scanline[i - self.stride]
            c = prior[i - self.stride]
        u = prior[i]
        scanline[i] = data[i-index] + self.filter(l, u, c)
    if index + L >= self.y_stride
        prior    = scanline
        scanline = self.data[self.next_scanline .: self.next_scanline + self.y_stride]
        index    = 0
        self.next_scanline += self.y_stride
    else
        index += L
    data = data[L .:]
assert index < self.y_stride
self.prior    = prior
self.scanline = scanline
self.index    = index

To always have an empty prior scanline, we initialize it to point to the second scanline. I also flip the addition out of the filter that helps a little bit.

It now takes 3.6 seconds at loading those 10 images.

100.0%  L3_17  100.0%  png_load.lc:3        (line 5)
100.0% .. L5_39  100.0%  png.lc:5
 99.2% .... L53_85  99.2%  png.lc:53        (line 231)
  0.2% ...... L90_91  0.2%  png.lc:90
  1.5% ...... L96_97  1.5%  png.lc:96
 27.2% ...... L99_110  27.4%  png.lc:99     (line 192)
  0.5% ...... L93_94  0.5%  png.lc:93
  0.3% .... L133_150  0.3%  png.lc:133

ImageMagick as comparison

To compare, if I run copy the image 10 times and then run:

time convert *.png converted.jpg

It shows me that it's taking 0.225s to do the job and it's not only reading the png but also converting it into jpg. We are at least 10 to 20 times slower than the C program and it is a bit unacceptable.

Optimization of method lookup

We looked at the trace again with pypy channel and found yet a lookup_method that was unelided. The likely reason was that the method table was not perceived as immutable. This resulted in the following silly pair of guards:

+600: p172 = getfield_gc_r(p1, descr=<FieldP space.customobject.CustomObject.inst_custom_interface 8 pure>)
+618: p175 = call_r(ConstClass(Interface.lookup_method), p172, ConstPtr(ptr174), descr=<Callr 8 rr EF=4>)
+710: guard_no_exception(descr=<Guard0x7f7b2d3a7d00>) [p0, p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12, p13, p14, p15, p16, p17, p18, p19, p20, p21, p22, p23, p24, p25, p26, p27, p28, p29, p30, p31, p32, p33, p34, p35, p36, p37, p38, p39, p40, p41, p42, p43, p44, p45, p46, p47, p48, p49, p50, p51, p52, p53, p54, p55, p56, p57, p58, p59, p60, p61, p62, p63, p64, p65, p66, p67, p68, p69, p70, p71, p72, p73, p74, p75, p76, p77, p78, p79, p80, p81, p82, p83, p84, p85, p86, p87, p88, p89, p90, p91, p92, p93, p94, p95, p96, p97, p98, p99, p100, p101, p102, p103, p104, p105, p106, p107, p114, p115, p116, p117, p118, p119, p120, p121, p122, p123, p124, p125, p126, p127, p128, p129, p130, p131, p132, p133, p134, p135, p136, p137, p138, p139, p140, p141, p142, p143, p144, p145, p146, p147, p148, p149, p150, p151, p152, p153, p154, p155, p156, p157, p158, p159, p160, p161, p162, p175, i171, i166]
+725: guard_isnull(p175, descr=<Guard0x7f7b2d3ac020>) [p0, p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12, p13, p14, p15, p16, p17, p18, p19, p20, p21, p22, p23, p24, p25, p26, p27, p28, p29, p30, p31, p32, p33, p34, p35, p36, p37, p38, p39, p40, p41, p42, p43, p44, p45, p46, p47, p48, p49, p50, p51, p52, p53, p54, p55, p56, p57, p58, p59, p60, p61, p62, p63, p64, p65, p66, p67, p68, p69, p70, p71, p72, p73, p74, p75, p76, p77, p78, p79, p80, p81, p82, p83, p84, p85, p86, p87, p88, p89, p90, p91, p92, p93, p94, p95, p96, p97, p98, p99, p100, p101, p102, p103, p104, p105, p106, p107, p114, p115, p116, p117, p118, p119, p120, p121, p122, p123, p124, p125, p126, p127, p128, p129, p130, p131, p132, p133, p134, p135, p136, p137, p138, p139, p140, p141, p142, p143, p144, p145, p146, p147, p148, p149, p150, p151, p152, p153, p154, p155, p156, p157, p158, p159, p160, p161, p162, p175, i171, i166]

Giving the JIT a hint to promote the custom_interface into a constant solves this problem. It means that the lookup_method can be then elided and the JIT can figure out that it never returns anything other than null and instead it just retrieves the value from a storage object.

After the improvement the program has shaved down to 2.7 seconds and the paeth filter appears to start becoming a bottleneck again.

100.0%  L3_17  100.0%  png_load.lc:3        (line 5)
100.0% .. L5_39  100.0%  png.lc:5
 98.7% .... L53_86  98.7%  png.lc:53        (line 231)
  0.1% ...... L91_92  0.1%  png.lc:91
  1.9% ...... L97_98  1.9%  png.lc:97
 31.8% ...... L100_111  32.2%  png.lc:100   (line 192)
  0.1% ...... L94_95  0.1%  png.lc:94
  1.0% .... L134_151  1.0%  png.lc:134
  0.1% ...... L151_153  14.3%  png.lc:151

An insight into JIT

It is perhaps that I got some intuition about the behavior of the metatracing JIT when I found out about the above problem.

At the start of installing a JIT driver I have to tell it the color of certain variables:

jitdriver = jit.JitDriver(
    greens=['pc', 'block', 'module', 'unit', 'excs', 'function'], #, 'ec'],
    reds=['frame'], # 'regv', 
    virtualizables = ['frame'],
    get_printable_location=get_printable_location,
    get_unique_id=get_unique_id, is_recursive=True)

The need to specify red variables is an implementation detail here, but the green variables are needed by the JIT. The greens tell which variables are supposed to stay constant during the JIT translation.

When a trace meets a green variable, it puts a guard for it and treats the program as if it that variable remained constant in the program. The information allows the tracing JIT to do deduction and optimize the trace.

Variables that can be deduced to be constants, such as the constants inside constants can be implicitly marked green without a guard. In other hand if the guard fails, then the program needs to branch and has to trace again.

In a well-tuned JIT, methods that are not dynamically used are promoted into constants.

Shifting filter call out of the loop

It turns out that we can still squeeze some juice by restructuring the program. Lets take this loop inside the reader loop:

for i in range(index, index+L)
    if i < self.stride
        l = 0
        c = 0
    else
        l = scanline[i - self.stride]
        c = prior[i - self.stride]
    u = prior[i]
    scanline[i] = data[i-index] + self.filter(l, u, c)

And replace it with this:

self.filter(prior, scanline, data[.: L], self.stride, index)

Then we write several smaller loops like this:

# average
((prior, scanline, data, stride, offset):
    for i in range(offset, offset+data.length)
        u = prior[i]
        if i >= stride
            u = scanline[i - stride] + u
        scanline[i] = data[i - offset] + u // 2
),

This shaves the runtime speed down to 1.9 seconds. It means that one image takes 200ms to load.

100.0%  L3_17  100.0%  png_load.lc:3        (line 5)
100.0% .. L5_39  100.0%  png.lc:5
 98.7% .... L53_78  98.7%  png.lc:53        (line 231, with 289)
  0.8% ...... L85_91  0.9%  png.lc:85
  4.4% ...... L98_104  4.5%  png.lc:98
 84.6% ...... L106_126  85.7%  png.lc:106   (line 300)
  0.8% ...... L93_96  0.9%  png.lc:93
  0.6% .... L149_166  0.6%  png.lc:149
  0.2% ...... L166_168  33.3%  png.lc:166

The statistics show out, perhaps reassuringly, that the paeth filter constitutes 80% of the runtime. If we would somehow manage to shave it to tenth of its time then it would take only 550ms from the program to run the benchmark.

How much faster could it run?

Now there's an interesting question: If I optimized just the paeth-predictor loop, how fast could this program run?

If I temporarily disable the paeth loop, the app then takes 0.3s to run. So if that part of the program didn't run at all, that would be the limit that we cannot exceed by optimizing that only function in the program.

If we managed to optimize all of the inner loops, the vmprofile would then start to look like this:

100.0%  L3_17  100.0%  png_load.lc:3        (line 5)
 10.0% .. L30_41  10.0%  png_load.lc:30
 86.0% .. L5_39  86.0%  png.lc:5
  6.0% .... L154_171  7.0%  png.lc:154
 76.0% .... L53_78  88.4%  png.lc:53        (line 231, with 289)

The assembler

Motivated from the fast paeth filter, I decided to pick up the google/CPU-instructions dataset and implement an assembler that uses those tables.

If you are ever building an assembler remember that it helps a whole lot if you have a machine-readable table that tells the encoding and operands for every instruction.

While writing my own I also noticed that the encoding task is easy if you break it into two stages. In the first stage you fill up the parameters such as:

rex byte
operand_op field (for operands that embed into the opcode fields)
modrm byte
sib byte
displacement offset
immediate values passed in

Doing this kind of scanning through the operands makes it easier to determine what is the size of the displacement field or where the SIB byte belongs to.

The CPU-instructions also contains SSE and AVX instructions. And I expected they would require encoding of VEX or EVEX prefix. I didn't do it right yet, but I am sure to return on the subject sometime soon. In any case the proper use of those instructions requires feature detection code.

After looking at the mnemonics for a while, I decided to not use them and instead treat them as a form of documentation that describes what semantic the instruction follows.

Returns and branches that have far and near versions share the same mnemonic, so I found it a bit risky to just let a program select a certain instruction from the list and run with it. To work around this problem, I gave an unique identifier for every instruction in the list:

5931 XCHG AX r16                 [64] [L] Exchange r16 with AX.
5932 XCHG EAX r32                [64] [L] Exchange r32 with EAX.
5933 XCHG RAX r64                [64] Exchange r64 with RAX.
5934 XCHG m16 r16                [64] [L] Exchange r16 with word from r/m16.
5935 XCHG m32 r32                [64] [L] Exchange r32 with doubleword from r/m32.
5936 XCHG m64 r64                [64] Exchange r64 with quadword from r/m64.
5937 XCHG m8 r8                  [64] [L] Exchange r8 (byte register) with byte from r/m8.
5938 XCHG m8 r8                  [64] Exchange r8 (byte register) with byte from r/m8.
5939 XCHG r16 AX                 [64] [L] Exchange AX with r16.
5940 XCHG r16 m16                [64] [L] Exchange word from r/m16 with r16.
5941 XCHG r16 r16                [64] [L] Exchange word from r/m16 with r16.
5942 XCHG r16 r16                [64] [L] Exchange r16 with word from r/m16.
5943 XCHG r32 EAX                [64] [L] Exchange EAX with r32.
5944 XCHG r32 m32                [64] [L] Exchange doubleword from r/m32 with r32.
5945 XCHG r32 r32                [64] [L] Exchange doubleword from r/m32 with r32.
5946 XCHG r32 r32                [64] [L] Exchange r32 with doubleword from r/m32.
5947 XCHG r64 RAX                [64] Exchange RAX with r64.
5948 XCHG r64 m64                [64] Exchange quadword from r/m64 with r64.
5949 XCHG r64 r64                [64] Exchange r64 with quadword from r/m64.
5950 XCHG r64 r64                [64] Exchange quadword from r/m64 with r64.
5951 XCHG r8 m8                  [64] [L] Exchange byte from r/m8 with r8 (byte register).
5952 XCHG r8 m8                  [64] Exchange byte from r/m8 with r8 (byte register).
5953 XCHG r8 r8                  [64] [L] Exchange r8 (byte register) with byte from r/m8.
5954 XCHG r8 r8                  [64] [L] Exchange byte from r/m8 with r8 (byte register).
5955 XCHG r8 r8                  [64] Exchange r8 (byte register) with byte from r/m8.
5956 XCHG r8 r8                  [64] Exchange byte from r/m8 with r8 (byte register).

Many of these are the same opcode but with different prefixes or flags. I don't have a method to calculate the size of these, and I do not have a heuristic on selecting them.

To represent Addresses in the memory operands, I picked the following format:

Address(type, offset, base=-1, index=-1, scale=1)

You can have the address without the base register if the additional 4 bytes from offset is tolerable.

Now I finally can emit instructions like this:

# 1006 LEA r64 m                   [64] Store effective address for m in register r64.
emit(1006, Register(i64, 11), Address(i64, 0, 10, 13))
# ABS on R11
# 1127 MOV r64 r64                 [64] Move r/m64 to r64.
emit(1127, Register(i64, 13), Register(i64, 11)) 
# 1289 NEG r64                     [64] Two's complement negate r/m64.
emit(1289, Register(i64, 11))
#  335 CMOVL r64 m64               [64] Move if less (SF≠  OF).
emit( 335, Register(i64, 11), Register(i64, 13))

I linked the loops by running the assembler twice and using ordinary variables for them:

emit( 878, Immediate(i8, loop_label - jl1_label))
jl1_label := output.length

As long as the instruction numbers and lengths do not change, the second run reaches the fixed point.

The paeth decoder written in the assembly such as above looks like this:

assert platform.arch == "x86_64"
    "It just might not work on x86"
assert platform.name.startswith("linux")
    "Also this depends on the SystemV calling conventions."

arg_0 = Register(i64, 7) # RDI
arg_1 = Register(i64, 6) # RSI
arg_2 = Register(i64, 2) # RDX
arg_3 = Register(i64, 1) # RCX
arg_4 = Register(i64, 8)
arg_5 = Register(i64, 9)

loop_label = 0
exit_label = 0
ja_label = 0
jl1_label = 0
jl2_label = 0

assemble = ():
    # 1811 PUSH r64                    [64] Push r64.
    # we need additional register to clobber.
    emit(1811, Register(i64, 3)) # x .r3
    emit(1811, Register(i64, 10)) #   .r10
    emit(1811, Register(i64, 11)) #   .r11
    emit(1811, Register(i64, 12)) #   .r12
    emit(1811, Register(i64, 13)) # a .r13
    emit(1811, Register(i64, 14)) # b .r14
    emit(1811, Register(i64, 15)) # c .r15
    # i = 0 .r0
    # 5973 XOR m64 r64                 [64] r/m64 XOR r64.
    emit(5973, Register(i64, 0), Register(i64, 0))
    emit(5973, Register(i64, 13), Register(i64, 13))
    emit(5973, Register(i64, 15), Register(i64, 15))
    # j = offset .arg_4(8)
    # 1289 NEG r64                     [64] Two's complement negate r/m64.
    #   74 ADD r64 r64                 [64] Add r64 to r/m64.
    # k = j - stride .r1
    emit(1289, Register(i64, 1))
    emit(  74, Register(i64, 1), Register(i64, 8))
    #  878 JL rel8                     [64] [L] Jump short if less (SF≠  OF).
    emit( 878, Immediate(i8, loop_label - jl1_label))
    jl1_label := output.length
    emit(1254, Register(i64, 13), Address(i8, 0, 7, 1)) # [prior .r7 + k .r1]
    emit(1254, Register(i64, 15), Address(i8, 0, 6, 1)) # [scanline .r6 + k .r1]
    # loop:
    loop_label := output.length
    # 0475 CMP m64 r64                 [64] Compare r64 with r/m64.
    # if i >= length goto exit
    emit( 475, Register(i64, 0), arg_5)
    # 0862 JAE rel32                   [64] [L] Jump short if above or equal (CF=0).
    emit( 862, Immediate(i32, exit_label - ja_label))
    ja_label := output.length
    ## Test
    # emit(  54, Address(i64, 0, arg_0.index), Immediate(i32, 1))
    # 1254 MOVZX r64 m8                [64] Move byte to quadword, zero-extension.
    #     x = input[i]
    emit(1254, Register(i64, 14), Address(i8, 0, 7, 8)) # [prior .r7 + j .r8]
    # Starting paeth predictor here. a.r13, b.r14, c.r15
    # 1006 LEA r64 m                   [64] Store effective address for m in register r64.
    # 2327 SUB r64 m64                 [64] Subtract r/m64 from r64.
    #p = a + b - c    p.r10
    emit(1006, Register(i64, 10), Address(i64, 0, 13, 14))
    emit(2327, Register(i64, 10), Register(i64, 15))
    emit(1289, Register(i64, 10)) # negate to use LEA again.
    #  335 CMOVL r64 m64               [64] Move if less (SF≠  OF).
    # 1127 MOV r64 r64                 [64] Move r/m64 to r64.
    #pa = abs(a - p)  p.r10 a.r13     pa.r11
    emit(1127, Register(i64,  3), Register(i64, 13)) 
    emit(1006, Register(i64, 11), Address(i64, 0, 10, 13)) # The a.r13 is free now.
    # ABS on R11
    emit(1127, Register(i64, 13), Register(i64, 11)) 
    emit(1289, Register(i64, 11)) # negate to use CMOVL
    emit( 335, Register(i64, 11), Register(i64, 13))
    #pb = abs(b - p)  p.r10 a.r13     pa.r12
    emit(1006, Register(i64, 12), Address(i64, 0, 10, 14))
    # ABS on R12
    emit(1127, Register(i64, 13), Register(i64, 12)) 
    emit(1289, Register(i64, 12)) # negate to use CMOVL
    emit( 335, Register(i64, 12), Register(i64, 13))
    # Now we want to compare and CMOVL if relevant.
    emit( 475, Register(i64, 12), Register(i64, 11))
    emit( 335, Register(i64, 11), Register(i64, 12))
    emit( 335, Register(i64,  3), Register(i64, 14))
    #pc = abs(c - p)  p.r10 a.r13     pa.r11
    emit(1006, Register(i64, 12), Address(i64, 0, 10, 15))
    # ABS on R12
    emit(1127, Register(i64, 13), Register(i64, 12)) 
    emit(1289, Register(i64, 12)) # negate to use CMOVL
    emit( 335, Register(i64, 12), Register(i64, 13))
    # Now we want to compare and CMOVL if relevant.
    emit( 475, Register(i64, 12), Register(i64, 11))
    emit( 335, Register(i64, 11), Register(i64, 12))
    emit( 335, Register(i64,  3), Register(i64, 15))
    # Done with paeth, next we add.
    #emit(1254, Register(i64, 3), Address(i8, 0, 2, 0)) # [input .r2 + i .r0]
    #   78 ADD r8 m8                   [64] [L] Add r/m8 to r8.
    emit(  78, Register(i8, 3), Address(i8, 0, 2, 0)) # [input .r2 + i .r0]

    # 1097 MOV m8 r8                   [64] [L] Move r8 to r/m8.
    #     scanline[j] = x
    emit(1097, Address(i8, 0, 6, 8), Register(i8, 3)) # [scanline .r6 + j .r8]
    # 0072 ADD r64 imm8                [64] Add sign-extended imm8 to r/m64.
    #     i += 1
    emit(  72, Register(i64, 0), Immediate(i8, 1))
    #     j += 1
    emit(  72, Register(i64, 8), Immediate(i8, 1))
    # clean a and c
    emit(5973, Register(i64, 13), Register(i64, 13))
    emit(5973, Register(i64, 15), Register(i64, 15))
    #     k += 1
    emit(  72, Register(i64, 1), Immediate(i8, 1))

    #  878 JL rel8                     [64] [L] Jump short if less (SF≠  OF).
    emit( 878, Immediate(i8, loop_label - jl2_label))
    jl2_label := output.length
    emit(1254, Register(i64, 13), Address(i8, 0, 7, 1)) # [prior .r7 + k .r1]
    emit(1254, Register(i64, 15), Address(i8, 0, 6, 1)) # [scanline .r6 + k .r1]

    #     jump loop
    # 0886 JMP rel32                   [64] [L] Jump short, RIP = RIP + 8-bit displacement sign extended to 64-bits
    emit( 886, Immediate(i32, loop_label - exit_label))
    # exit:
    exit_label := output.length
    # Restore .r3 we clobbered. 
    # 1638 POP r64                     [64] Pop top of stack into r64; increment stack pointer. Cannot encode 32-bit operand size.
    emit(1638, Register(i64, 15))
    emit(1638, Register(i64, 14))
    emit(1638, Register(i64, 13))
    emit(1638, Register(i64, 12))
    emit(1638, Register(i64, 11))
    emit(1638, Register(i64, 10))
    emit(1638, Register(i64, 3))
    #     return
    emit(1901) # RET NEAR


emit = (uid, args...):
    output.extend(asm.encode_ins(uid, args))

output = []
assemble()
output = []
assemble()

At the point when I was going to write the actual algorithm the loop was supposed to run, I noticed that available registers almost ran out.

From here it's easy to run this in the place of the original paeth filter. We copy the assembler output into executable memory region and W^X it. I already have an utility for that:

buf = mman.Asmbuf(4096)
buf.memcpy(Uint8Array(output))
buf.finalize()

Next we need a function handle that we can call. We get one from the FFI:

c_type = ffi.cfunc(ffi.int, [
    ffi.voidp, ffi.voidp, ffi.voidp,   # prior, scanline, data
    ffi.size_t,ffi.size_t,ffi.size_t]) # stride, offset, length
c_func = ffi.cast(buf, c_type)

The filter array requires a little rewrite so that the length -field is exposed like above, but otherwise this already works and we can hot-patch the assembler program into the png module:

import png
png.decode_filters[4] = c_func
main_png()

Now we can run the original benchmark.

Assembler results

With the monkey-patched assembly module, our original benchmark now takes 0.6 seconds to run, and the vmprofile report is very closely resembling the removal of paeth decoder:

100.0%  L5_200  100.0%  fast_paeth.lc:5
100.0% .. L203_217  100.0%  fast_paeth.lc:203
  6.6% .... L230_241  6.6%  fast_paeth.lc:230
 92.3% .... L5_39  92.3%  png.lc:5
 11.0% ...... L149_166  11.9%  png.lc:149
  3.3% ........ L166_168  30.0%  png.lc:166
 79.1% ...... L53_78  85.7%  png.lc:53      (line 231, with 289)
  1.1% ........ L93_96  1.4%  png.lc:93
 24.2% ........ L98_104  30.6%  png.lc:98
  4.4% ........ L85_91  5.6%  png.lc:85

I did double check that the paeth decoder keeps working. I made a hexadecimal dump from the image and compared the data with the optimized version. Both of them look like this:

35 51 45 ff 36 46 3b ff 1b 1e 19 ff 18 1b 16 ff
33 4f 43 ff 27 3a 2f ff 1b 1e 19 ff 18 1b 16 ff
2c 4a 40 ff 2e 4c 42 ff 2f 4d 43 ff 34 4f 46 ff
27 45 3b ff 29 47 3d ff 2a 48 3e ff 28 48 3a ff
27 47 3c ff 26 44 3a ff 25 43 39 ff 23 43 39 ff

You can tell that this data is at least plausibly correct, because of the full alpha channels separating the scanlines into columns of ff -symbols.

The closer inspection shows that the assembled version has some errors. It still runs through all the pixels so I'd believe the problem could be something subtle:

GLSL output

I won't use the assembled version right yet, so I leave the fix for later iterations.

Finally I also made a Vulkan demo that uses the JIT-optimized png module. Here's the screenshot:

GLSL output

Hand-written assembly is not practical

The assembler optimization is very successful considering that we don't even have to store any compiled executables here and we are still 4 times faster than within the JIT.

The problem is that the assembly module here cannot adjust for different calling conventions. It only works on Linux and on x86_64. There are other problems:

We have two versions to maintain. And if we discard the old version then we cannot return to JIT-evaluating the code in the case it is more practical for some possible usecase.
We had to rewrite the assembler version from the scratch when trying to optimize the program that way.
We cannot access all GC records from the assembly. Only the ones that we could naturally pass into C libraries.
The produced assembly code is calling-convention dependent and fixed to an operating system.

The practical solution would be to run type inference into the original program and then compile it down to assembly codes. It is the reason why I wrote the assembler in the first place, but it is also a subject for a second blog post.

Driving it through GLSL filter in Vulkan

Finally I made a vulkan application that uses the PNG loader. The full source code can be found at the samples/warpgpu/2_images.lc.

After the image has been loaded, we upload it to the GPU by memcopying into mapped location like this:

image.mem = gpu.mem.associate(image, [
    "HOST_VISIBLE_BIT", "HOST_COHERENT_BIT"
])
data = image.mem.map(ffi.byte, image.offset, image.size)
ffi.memcpy(data, png_image.data, png_image.data.length)

Of course, this is a bit wasteful because the PNG loader wouldn't really need anything else but two latest scanlines. It could write into the scanline and the image simultaneously to load the image without ever holding the fully decompressed image in the memory.

It was nice to start with the simple usecase first though.

Finally I checked the image in the GLSL output. It producing the effect by shifting the texture coordinates horizontally using output from a voronoi noise generator:

GLSL output, animated

Extra: Lets try SSE & SIMD instructions!

At first I thought I wouldn't go this far, but I still had Sunday to go and everyone reading this post would expect me to get into SIMD. So let's go!

The feature detection is mandatory if you intend to use SIMD, but the first thing was to check what my CPU actually supports. So I ran the cat /proc/cpuinfo.

vendor_id   : GenuineIntel
cpu family  : 6
model       : 23
model name  : Intel(R) Core(TM)2 Quad  CPU   Q9550  @ 2.83GHz
stepping    : 7
microcode   : 0x70a
cpu MHz     : 2003.000
cache size  : 6144 KB
physical id : 0
siblings    : 4
core id     : 0
cpu cores   : 4
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 10
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority dtherm
bugs        :
bogomips    : 5666.40
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

It seems that I can use all the instructions up to SSE4_1. Unfortunately the use of the SSE doesn't require the VEX prefix so I don't have a way to test whether I would get the VEX encoding right.

It's no wonder that stereotypical geeks are charismatic white males with fancy beards and no girlfriends! You would need very big wads of cash to stay updated on all the coolest hardware that pushes out of the factories at the constant feed.

Quick look into the instruction table tells that I have 565 SIMD operations available. When I discard the operations that work on floating point values, I end up with 361 operations to choose from.

Vectorizing the program

The first step is to look at the original program and see what we could do for it. To make this easier we break it to break it into extended basic blocks:

t = a + b
p = t - c
ta = p - a
pa = abs ta
tb = p - b
pb = abs tb
tc = p - c
pc = abs tc
c1 = le pa pb
c2 = le pa pc
c3 = and c1 c2
return a if c3
c4 = le pb pc
return b if c4
return c

It is surprisingly lot of operations after all. What we are about to do is called auto vectorization in compilers. I guess what we are specifically about to do is called loop vectorization.

One important thing is to note that the algorithm has not been defined on bytes. This means that we might get the wrong result if we keep the bytes packed in our registers. We may have to unpack so we are going to need instructions for those. Also we are going to need the instructions to load values into the xmm registers. I think the following is sufficient:

PACKSSDW PACKSSDW PACKSSDW PACKSSDW PACKSSWB PACKSSWB PACKSSWB PACKSSWB PACKUSWB PACKUSWB PACKUSWB PACKUSWB
PUNPCKHBW PUNPCKHBW PUNPCKHBW PUNPCKHBW PUNPCKHDQ PUNPCKHDQ PUNPCKHDQ
PUNPCKHDQ PUNPCKHQDQ PUNPCKHQDQ PUNPCKHWD PUNPCKHWD PUNPCKHWD PUNPCKHWD
PUNPCKLBW PUNPCKLBW PUNPCKLBW PUNPCKLBW PUNPCKLDQ PUNPCKLDQ PUNPCKLDQ 
PUNPCKLDQ PUNPCKLQDQ PUNPCKLQDQ PUNPCKLWD PUNPCKLWD PUNPCKLWD PUNPCKLWD 
MOVDQA MOVDQU

Second I've been looking up what I can find from our instruction table of valid operations. I found the following:

t = a + b         PADDB, PADDD, PADDQ, PADDSB, PADDSW, PADDUSB, PADDUSW, PADDW, PADDW, PALIGNR [m64, m128]
p = t - c         PSUBB, PSUBD, PSUBQ, PSUBSB, PSUBSW, PSUBUSB, PSUBUSW, PSUBW
ta = p - a
pa = abs ta       PABSB, PABSD PABSW [m64, m128]
tb = p - b
pb = abs tb
tc = p - c
pc = abs tc
c1 = le pa pb     PCMPGTB, PCMPGTW, PCMPGTD [m64, m128]
c2 = le pa pc
c3 = and c1 c2    PAND, PANDN [m64, m128]
return a if c3
c4 = le pb pc
return b if c4
return c

It appears that we have m128 for everything so we can use the SSE2 instructions everywhere here. We would also have PHADDW to add packed short integers horizontally.

The SIMD instructions do not have conditional operations, so instead we have to use a trick on them.

return a if c3    PAND, PANDN, POR, PXOR [m64, m128]
c4 = le pb pc
return b if c4
return c

Here we have everything we need to carry this out now. The final program:

result = 1168 MOVDQU [input + i - offset]
a = 1164 MOVDQU [scanline + i - stride]
b = 1168 MOVDQU [prior + i]
c = 1164 MOVDQU [prior + i - stride]
al = 1786 PUNPCKLBW a
bl = 1786 PUNPCKLBW b
cl = 1786 PUNPCKLBW c
ah = 1772 PUNPCKHBW a
bh = 1772 PUNPCKHBW b
ch = 1772 PUNPCKHBW c
pal  = 1165 MOVDQA al
pah  = 1165 MOVDQA ah
pal += 1395 PADDD bl
pah += 1395 PADDD bh
pal -= 1742 PSUBD al
pah -= 1742 PSUBD ah
pbl  = 1165 MOVDQA pal
pbh  = 1165 MOVDQA pah
pcl  = 1165 MOVDQA pal
pch  = 1165 MOVDQA pah
pal -= 1742 PSUBD al
pbl -= 1742 PSUBD bl
pcl -= 1742 PSUBD cl
pah -= 1742 PSUBD ah
pbh -= 1742 PSUBD bh
pch -= 1742 PSUBD ch
pal = 1369 PABSD pal
pbl = 1369 PABSD pbl
pcl = 1369 PABSD pcl
pah = 1369 PABSD pah
pbh = 1369 PABSD pbh
pch = 1369 PABSD pch
c1l = 1472 PCMPGTD pbl pal
c2l = 1472 PCMPGTD pcl pal
c4l = 1472 PCMPGTD pcl pbl
c1h = 1472 PCMPGTD pbh pah
c2h = 1472 PCMPGTD pch pah
c4h = 1472 PCMPGTD pch pbh
c3l = 1427 PAND c1l c2l
c3h = 1427 PAND c1h c2h
bl = 1427 PAND bl c4l
bh = 1427 PAND bh c4h
cl = 1431 PANDN cl, c4l
ch = 1431 PANDN ch, c4h
bl = 1650 POR bl, cl
bh = 1650 POR bh, ch
al = 1427 PAND al, c3l
ah = 1427 PAND ah, c3h
bl = 1431 PANDN bl, c3l
bh = 1431 PANDN bh, c3h
al = 1650 POR al, bl
ah = 1650 POR ah, bh
delta = 1387 PACKUSWB al, ah
result += 1391 PADDB delta
1167 MOVDQU [scanline], result

It's ridiculous! 54 operations that run on 16 bytes of data at once. Next we should allocate registers and produce the program. On Linux we are free to use the whole range of XMM registers without having to store them, so we do so.

simd_paeth_assembly = (emit, input_addr, a_addr, b_addr, c_addr, scanline_addr):
    xmm0 = Register(m128, 0)
    xmm1 = Register(m128, 1)
    xmm2 = Register(m128, 2)
    xmm3 = Register(m128, 3)
    xmm4 = Register(m128, 4)
    xmm5 = Register(m128, 5)
    xmm6 = Register(m128, 6)
    xmm7 = Register(m128, 7)
    xmm8 = Register(m128, 8)
    xmm9 = Register(m128, 9)
    xmm10 = Register(m128, 10)
    xmm11 = Register(m128, 11)
    xmm12 = Register(m128, 12)
    xmm13 = Register(m128, 13)
    xmm14 = Register(m128, 14)
    xmm15 = Register(m128, 15)
    # register allocation
    result = xmm0
    al     = xmm1
    bl     = xmm2
    cl     = xmm3
    ah     = xmm4
    bh     = xmm5
    ch     = xmm6
    pal    = xmm7
    pbl    = xmm8
    pcl    = xmm9
    pah    = xmm10
    pbh    = xmm11
    pch    = xmm12
    # result = 1164 MOVDQU [input + i - offset]
    emit(1168, result, input_addr)
    # a = 1168 MOVDQU   [scanline + i - stride]
    emit(1164, al, a_addr)
    # b = 1168 MOVDQU [prior + i]
    emit(1168, bl, b_addr)
    # c = 1168 MOVDQU   [prior + i - stride]
    emit(1164, cl, c_addr)
    # ah = 1772 PUNPCKHBW a
    # bh = 1772 PUNPCKHBW b
    # ch = 1772 PUNPCKHBW c
    emit(1772, ah, al)
    emit(1772, bh, bl)
    emit(1772, ch, cl)
    # al = 1786 PUNPCKLBW a
    # bl = 1786 PUNPCKLBW b
    # cl = 1786 PUNPCKLBW c
    emit(1786, al, al)
    emit(1786, bl, bl)
    emit(1786, cl, cl)
    # pal  = 1165 MOVDQA al
    # pah  = 1165 MOVDQA ah
    emit(1165, pal, al)
    emit(1165, pah, ah)
    # pal += 1395 PADDD bl
    # pah += 1395 PADDD bh
    emit(1395, pal, bl)
    emit(1395, pah, bh)
    # pal -= 1742 PSUBD al
    # pah -= 1742 PSUBD ah
    emit(1742, pal, al)
    emit(1742, pah, ah)
    # pbl  = 1165 MOVDQA pal
    # pbh  = 1165 MOVDQA pah
    # pcl  = 1165 MOVDQA pal
    # pch  = 1165 MOVDQA pah
    emit(1165, pbl, pal)
    emit(1165, pbh, pah)
    emit(1165, pcl, pcl)
    emit(1165, pch, pch)
    # pal -= 1742 PSUBD al
    # pbl -= 1742 PSUBD bl
    # pcl -= 1742 PSUBD cl
    # pah -= 1742 PSUBD ah
    # pbh -= 1742 PSUBD bh
    # pch -= 1742 PSUBD ch
    emit(1742, pal, al)
    emit(1742, pbl, bl)
    emit(1742, pcl, cl)
    emit(1742, pah, ah)
    emit(1742, pbh, bh)
    emit(1742, pch, ch)
    # pal = 1369 PABSD pal
    # pbl = 1369 PABSD pbl
    # pcl = 1369 PABSD pcl
    # pah = 1369 PABSD pah
    # pbh = 1369 PABSD pbh
    # pch = 1369 PABSD pch
    emit(1369, pal, pal)
    emit(1369, pbl, pbl)
    emit(1369, pcl, pcl)
    emit(1369, pah, pah)
    emit(1369, pbh, pbh)
    emit(1369, pch, pch)
    # We would need at least 4 registers more, so
    # we need to figure out which ones we can reuse.
    c1l    = xmm13
    c1h    = xmm14
    c2l    = xmm15
    # If we do some reordering here, we can
    # reuse pal
    # c1l = 1472 PCMPGTD pbl pal
    # c1h = 1472 PCMPGTD pbh pah
    # c2l = 1472 PCMPGTD pcl pal
    # c2h = 1472 PCMPGTD pch pah
    emit(1164, c1l, pbl)
    emit(1472, c1l, pal)
    emit(1164, c2l, pcl)
    emit(1472, c2l, pal)
    c2h = pal
    emit(1164, c2h, pch)
    emit(1472, c2h, pah)
    emit(1164, c1h, pbh)
    emit(1472, c1h, pah)
    # c4l = 1472 PCMPGTD pcl pbl
    # c4h = 1472 PCMPGTD pch pbh
    # We can reuse pcl/pch for c4l, c4h
    c4l = pcl # emit(1164, c4l, pcl)
    c4h = pch # emit(1164, c4h, pch)
    emit(1472, c4l, pbl)
    emit(1472, c4h, pbh)
    # c3l = 1427 PAND c1l c2l
    # c3h = 1427 PAND c1h c2h
    # We can reuse c1* for c3*
    c3l = c1l
    c3h = c1h
    emit(1427, c3l, c2l)
    emit(1427, c3h, c2h)
    # bl = 1427 PAND bl c4l
    # bh = 1427 PAND bh c4h
    emit(1427, bl, c4l)
    emit(1427, bh, c4h)
    # cl = 1431 PANDN cl, c4l
    # ch = 1431 PANDN ch, c4h
    emit(1431, cl, c4l)
    emit(1431, ch, c4h)
    # bl = 1650 POR bl, cl
    # bh = 1650 POR bh, ch
    emit(1650, bl, cl)
    emit(1650, bh, ch)
    # al = 1427 PAND al, c3l
    # ah = 1427 PAND ah, c3h
    emit(1427, al, c3l)
    emit(1427, ah, c3h)
    # bl = 1431 PANDN bl, c3l
    # bh = 1431 PANDN bh, c3h
    emit(1431, bl, c3l)
    emit(1431, bh, c3h)
    # al = 1650 POR al, bl
    # ah = 1650 POR ah, bh
    emit(1650, al, bl)
    emit(1650, ah, bh)
    # delta = 1387 PACKUSWB al, ah
    delta = al
    emit(1387, delta, ah)
    # result += 1391 PADDB delta
    emit(1391, result, delta)
    # 1168 MOVDQU [scanline + i], result
    emit(1167, scanline_addr, result)
    return result

The dump of the above program is 288 bytes long. A quick look with ndisasm revealed that the assembler works. Now we have to install the above thing into a loop to have it run. We can reuse most of our old loop:

assemble = ():
    # 1811 PUSH r64                    [64] Push r64.
    # we no longer clobber any registers we should save.
    # i = 0 .r0
    # 5973 XOR m64 r64                 [64] r/m64 XOR r64.
    emit(5973, Register(i64, 0), Register(i64, 0))
    # j = offset .arg_4(8)
    # 1289 NEG r64                     [64] Two's complement negate r/m64.
    #   74 ADD r64 r64                 [64] Add r64 to r/m64.
    # k = j - stride .r1
    emit(1289, Register(i64, 1))
    emit(  74, Register(i64, 1), Register(i64, 8))
    # loop:
    loop_label := output.length
    # 0475 CMP m64 r64                 [64] Compare r64 with r/m64.
    # if i >= length goto exit
    emit( 475, Register(i64, 0), arg_5)
    # 0862 JAE rel32                   [64] [L] Jump short if above or equal (CF=0).
    emit( 862, Immediate(i32, exit_label - ja_label))
    ja_label := output.length

    simd_paeth_assembly(emit,
        Address(m128, 0, 2, 0), # input  = [input    .r2 + i .r0]
        Address(m128, 0, 6, 1), # a      = [scanline .r6 + k .r1]
        Address(m128, 0, 7, 8), # b      = [prior    .r7 + j .r8]
        Address(m128, 0, 7, 1), # c      = [prior    .r7 + k .r1]
        Address(m128, 0, 6, 8)) # output = [scanline .r6 + j .r8]
    #     i, j, k += 16
    emit(  72, Register(i64, 0), Immediate(i8, 16))
    emit(  72, Register(i64, 8), Immediate(i8, 16))
    emit(  72, Register(i64, 1), Immediate(i8, 16))
    #     jump loop
    # 0886 JMP rel32                   [64] [L] Jump short, RIP = RIP + 8-bit displacement sign extended to 64-bits
    emit( 886, Immediate(i32, loop_label - exit_label))
    # exit:
    exit_label := output.length
    #     return
    emit(1901) # RET NEAR

Now, unlike the original assembly program, we can't use this new implementation in isolation because it only runs through 16-byte chunks at a time.

We can't use this new implementation just as it because it churns through 16-byte chunks in one swoop. Therefore we need the JIT version to meet up the beginning and end, so these two have to be combined:

make_new_filter = (optimized, ordinary):
    return (prior, scanline, input, stride, offset, length):
        # max(0, offset - stride) + 16  must be filled.
        fast_offset = max(max(0, offset-stride) + 16, offset)
        fast_index = fast_offset - offset
        remain_length = length - fast_index
        last_length = remain_length & 16
        fast_length = remain_length - last_length
        last_index = length - last_length
        last_offset = offset + last_index
        ordinary(prior, scanline, input,
            stride, offset, fast_index)
        optimized(prior, scanline, input[fast_index .:],
            stride, fast_offset, fast_length)
        ordinary(prior, scanline, input[last_index .:],
            stride, last_offset, last_length)

That was already a bit tricky to write.

Debugging

When we run the code we notice it produces a wrong output. Here's the first two scanlines of the JIT-only version:

7f 83 82 ff 7c 80 7f ff 7f 83 82 ff 7b 7f 7e ff ... 48 52 2f ff 4f 59 36 ff 50 5a 37 ff 4e 58 35 ff
7e 82 81 ff 7b 7f 7e ff 7e 82 81 ff 7a 7e 7d ff ... 50 5a 37 ff 57 61 3e ff 57 61 3e ff 52 5c 39 ff

And here's our JIT+SIMD version:

7f 83 82 ff 7c 80 7f ff 7f 83 82 ff 7b 7f 7e ff ... 48 52 2f ff 4f 59 36 ff 50 5a 37 ff 4e 58 35 ff
7e 82 81 ff 7b 7f 7e ff 7e 82 81 ff 7a 7e 7d ff ... 07 07 07 00 0e 0e 0e 00 0e 0e 0e 00 09 09 09 00

It looks a lot like we would have something in our paeth. The first line looks okay, but actually it's running a different filter there. Our filter runs on the second row. Fortunately we can define a software breakpoint.

#  852 INT 3                       [64] [L] Interrupt 3-trap to debugger.
emit( 852, null)

But maybe I haste here because my display shows only 16 bytes from the beginning and the end! This means that on the left side our optimized program hasn't been running at all.

So our filter is supposed to fill the first 16 bytes so that the actual program can start there like this:

pre-fill:   AA AA AA AA  AA AA AA AA    AA AA AA AA  AA AA AA AA
scanline+k: 00 00 00 00  00 00 00 00    00 00 00 00  00 00 00 00
scanline+j:              00 00 00 00    00 00 00 00  00 00 00 00  00 00 00 00

But then we actually start the optimized routine like this:

pre-fill:   AA AA AA AA  AA AA AA AA    AA AA AA AA  AA AA AA AA
scanline+k:                                          00 00 00 00    00 00 00 00  00 00 00 00    00 00 00 00
scanline+j:                                                         00 00 00 00  00 00 00 00    00 00 00 00  00 00 00 00

Oops. Here's the fixed version.

make_new_filter = (optimized, ordinary):
    return (prior, scanline, input, stride, offset, length):
        if length < 32 # if it's so small chunk that our thing breaks
            return ordinary(prior, scanline, input, stride, offset, length)
        # max(0, offset - stride) + 16  must be filled.
        fill_offset = max(max(0, offset-stride) + 16, offset)
        fast_offset = max(offset, stride)
        fast_index = fast_offset - offset
        remain_length = length - fast_index
        last_length = remain_length & 15
        fast_length = remain_length - last_length
        last_index = length - last_length
        last_offset = offset + last_index
        ordinary(prior, scanline, input,
            stride, offset, fill_offset - offset)
        optimized(prior, scanline, input[fast_index .:],
            stride, fast_offset, fast_length)
        ordinary(prior, scanline, input[last_index .:],
            stride, last_offset, last_length)

It still gives the wrong results but they are more apparent now. Next we use that trap instruction I shown and run the following in the GDB:

display /x $xmm0.uint128
display /x $xmm1.uint128
display /x $xmm2.uint128
display /x $xmm3.uint128
display /x $xmm4.uint128
display /x $xmm5.uint128
display /x $xmm6.uint128
display /x $xmm7.uint128
display /x $xmm8.uint128
display /x $xmm8.uint128
display /x $xmm9.uint128
display /x $xmm10.uint128
display /x $xmm11.uint128
display /x $xmm12.uint128
display /x $xmm13.uint128
display /x $xmm14.uint128
display /x $xmm15.uint128
layout asm

After the MOVDQU instructions the xmm0-xmm3 looks like this.

1: /x $xmm0.uint128 = 0x00fefefe00ffffff00ffffff00ffffff
2: /x $xmm1.uint128 = 0xff7d7e7aff81827eff7e7f7bff81827e
3: /x $xmm2.uint128 = 0xff7d7e7aff7e7f7bff82837fff7f807c
4: /x $xmm3.uint128 = 0xff7e7f7bff82837fff7f807cff82837f

The xmm0 looks very familiar. I think it'll be apparent very soon why this code produces the wrong results.

After the punpck{lbw,hbw} have run, the register dump looks like the following:

1: /x $xmm0.uint128 = 0x00fefefe00ffffff00ffffff00ffffff
2: /x $xmm1.uint128 = 0xffff7e7e7f7f7b7bffff818182827e7e
3: /x $xmm2.uint128 = 0xffff828283837f7fffff7f7f80807c7c
4: /x $xmm3.uint128 = 0xffff7f7f80807c7cffff828283837f7f
5: /x $xmm4.uint128 = 0xff007d007e007a00ff00810082007e00
6: /x $xmm5.uint128 = 0xff007d007e007a00ff007e007f007b00
7: /x $xmm6.uint128 = 0xff007e007f007b00ff00820083007f00

It is suddenly painfully apparent that the destination and source register in the unpack operation cannot be the same. So we have to fix those lines like this:

# ah = 1817 PXOR ah
# bh = 1817 PXOR bh
# ch = 1817 PXOR ch
# al = 1817 PXOR al
# bl = 1817 PXOR bl
# cl = 1817 PXOR cl
emit(1817, al, al)
emit(1817, bl, bl)
emit(1817, cl, cl)
emit(1817, ah, ah)
emit(1817, bh, bh)
emit(1817, ch, ch)
# ah = 1772 PUNPCKHBW a
# bh = 1772 PUNPCKHBW b
# ch = 1772 PUNPCKHBW c
emit(1772, ah, pal)
emit(1772, bh, pbl)
emit(1772, ch, pcl)
# al = 1786 PUNPCKLBW a
# bl = 1786 PUNPCKLBW b
# cl = 1786 PUNPCKLBW c
emit(1786, al, pal)
emit(1786, bl, pbl)
emit(1786, cl, pcl)

Another iteration of the first entry with a debugger reveals that we calculate a + b - a rather than a + b - c like we should. This fixes that:

# pal -= 1742 PSUBD cl
# pah -= 1742 PSUBD ch
emit(1742, pal, cl)
emit(1742, pah, ch)

It seems that at the end we end up adding with zero, so that we are actually doing nothing in this program. The only reasonable explanation here is that we mess up the conditionals somehow.

After the end of the conditional block, the result looks like this:

1: /x $xmm0.uint128 = 0x00fefefe00ffffff00ffffff00ffffff (result)
2: /x $xmm1.uint128 = 0xff007e007f007b00ff00810082007e00 (al)
3: /x $xmm2.uint128 = 0xff00820083007f00ff007f0080007c00 (bl)
4: /x $xmm3.uint128 = 0xff007f0080007c00ff00820083007f00 (cl)
5: /x $xmm4.uint128 = 0xff007d007e007a00ff00810082007e00 (ah)
6: /x $xmm5.uint128 = 0xff007d007e007a00ff007e007f007b00 (bh)
7: /x $xmm6.uint128 = 0xff007e007f007b00ff00820083007f00 (ch)
8: /x $xmm7.uint128 = 0xffffffffffffffffffffffffffffffff (c2h)
10: /x $xmm8.uint128 = 0x00000100010001000000010001000100 (pbl)
11: /x $xmm9.uint128 = 0xffffffffffffffffffffffffffffffff (c4l)
12: /x $xmm10.uint128 = 0x00000100010001000000040004000400 (pah)
13: /x $xmm11.uint128 = 0x00000100010001000000010001000100 (pbh)
14: /x $xmm12.uint128 = 0xffffffffffffffffffffffffffffffff (c4h)
15: /x $xmm13.uint128 = 0x00000000000000000000000000000000 (c3l)
16: /x $xmm14.uint128 = 0x00000000000000000000000000000000 (c3h)
17: /x $xmm15.uint128 = 0xffffffffffffffffffffffffffffffff (c2l)

The labels reveal that we have the following state:

al, bl, cl, ah, bh, ch
c3 = false
c4 = true

This should mean that the result b is selected. The following instructions should correctly preserve the result.

# bl = 1427 PAND bl c4l
# bh = 1427 PAND bh c4h
emit(1427, bl, c4l)
emit(1427, bh, c4h)

They did so. Respectively these next ones should zero the cl and ch.

# cl = 1431 PANDN cl, c4l
# ch = 1431 PANDN ch, c4h
emit(1431, cl, c4l)
emit(1431, ch, c4h)

I think I messed up something again.. Reading out on this reveals that the PANDN inverts the destination and then does AND. It's not exactly what we expected. We have to flip them over and merge with the c3 and c4.

The result is still completely wrong:

7f 83 82 ff 7c 80 7f ff 7f 83 82 ff 7b 7f 7e ff ... 48 52 2f ff 4f 59 36 ff 50 5a 37 ff 4e 58 35 ff
7e 82 81 ff fe ff fe 00 fe ff ff 00 fe fe fe 00 ... 06 06 06 00 0d 0d 0d 00 0d 0d 0d 00 08 08 08 00

One of the culprits could be the PCMPGTD. We use it like this:

c1 = PCMPGTD pb pa

If this actually means 'greater than', then the code is:

c1 = pb > pa

And the condition was:

if pa <= pb and pa <= pc
    return a
elif pb <= pc
    return b
else
    return c

So there we have our problem. I had to think about this a bit and transform it into the correct form.

if not (pa > pb or pa > pc)
    return a
elif not pb > pc
    return b
else
    return c

Even then the code is not correct yet. Let's look at the input next. It looks weird because it's upside down. Let's label them and compare them with what we know.

input:  /x $xmm0.uint128 = 0x 00 fe fe fe 00 ff ff ff 00 ff ff ff 00 ff ff ff
a:      /x $xmm7.uint128 = 0x ff 7d 7e 7a ff 81 82 7e ff 7e 7f 7b ff 81 82 7e
b:      /x $xmm8.uint128 = 0x ff 7d 7e 7a ff 7e 7f 7b ff 82 83 7f ff 7f 80 7c
c:      /x $xmm9.uint128 = 0x             ff 7e 7f 7b ff 82 83 7f ff 7f 80 7c ff 82 83 7f

prior:    7f 83 82 ff 7c 80 7f ff 7f 83 82 ff 7b 7f 7e ff ... 48 52 2f ff 4f 59 36 ff 50 5a 37 ff 4e 58 35 ff
scanline: 7e 82 81 ff 7b 7f 7e ff 7e 82 81 ff 7a 7e 7d ff ... 50 5a 37 ff 57 61 3e ff 57 61 3e ff 52 5c 39 ff

This is precise. Next the data is packed into the 6 registers. We see them being cleaned:

al: /x $xmm1.uint128 = 0x00000000000000000000000000000000
bl: /x $xmm2.uint128 = 0x00000000000000000000000000000000
cl: /x $xmm3.uint128 = 0x00000000000000000000000000000000
ah: /x $xmm4.uint128 = 0x00000000000000000000000000000000
bh: /x $xmm5.uint128 = 0x00000000000000000000000000000000
ch: /x $xmm6.uint128 = 0x00000000000000000000000000000000

Then they get filled up with data:

al: /x $xmm1.uint128 = 0xff00 7e00 7f00 7b00 ff00 8100 8200 7e00
bl: /x $xmm2.uint128 = 0xff00 8200 8300 7f00 ff00 7f00 8000 7c00
cl: /x $xmm3.uint128 = 0xff00 7f00 8000 7c00 ff00 8200 8300 7f00
ah: /x $xmm4.uint128 = 0xff00 7d00 7e00 7a00 ff00 8100 8200 7e00
bh: /x $xmm5.uint128 = 0xff00 7d00 7e00 7a00 ff00 7e00 7f00 7b00
ch: /x $xmm6.uint128 = 0xff00 7e00 7f00 7b00 ff00 8200 8300 7f00

So the problem appears to be that we do the wrong kind of unpacking operation here. We can fix it with PSRLW which shifts bytes right.

The beginning of the sequnce started to look right at this point, but it got the rest wrong. I started to realise there is something fairly wrong here.

Ok. So we fill our first run like this:

7e 82 81 ff .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Then we jump ahead by 16 bytes and we start reading the input from here:

7e 82 81 ff .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
                                                ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^

You know what? At this point I realised I cannot do it this way. I can only compute contents on one pixel parallel using this parallelization technique. And it's messy considering what it does.

I found yet some more things I did wrong, such as the use of double word operations when I should have used operations on words. After finding those I started getting results that started to seem plausible but compared to the correct output they were still incorrect.

Lets leave it to the backburner

This was the first time when I've studied the SIMD instructions in detail. The outcome was a complete mess but I learned so much in the process that I'm not even frustrated.

The techniques presented here obviously work, but I need to write a compiler to make the machine code solution more practical as a choice.