Optimizing Lever PNG loading
The following code took over 9.4 seconds to run:
import png
for i in range(10)
image = png.read_file(dir ++ "frants_boe_villender_test.png")
The image to load was a 640x480 RGBA image from a painting of Frants Diderik Bøe, who specialized in still life and landscapes.
The program became slow when I added the scanline filtering, so I would expect to know what slows it down. But I am really not on the mood of guessing wrong, so let's run the code through a profiler like this:
import png, fs, vmprof
vmprof_fd = fs.open(dir ++ "png.vmprof",
fs.WRONLY | fs.TRUNC | fs.CREAT)
vmprof.enable(vmprof_fd, 0.0001)
for i in range(10)
image = png.read_file(dir ++ "frants_boe_villender_test.png")
vmprof.disable()
vmprof_fd.close()
The profiler output:
100.0% L3_16 100.0% png_load.lc:3 (line 5 at this post)
100.0% .. L5_39 100.0% png.lc:5
99.6% .... L53_82 99.6% png.lc:53
0.1% ...... L87_90 0.1% png.lc:87
0.1% ...... L90_96 0.1% png.lc:90
49.4% ...... L99_101 49.6% png.lc:99
35.4% ........ L104_117 71.6% png.lc:104
16.9% .......... L117_131 47.9% png.lc:117
0.6% ...... L96_99 0.6% png.lc:96
0.3% .... L142_159 0.3% png.lc:142
0.1% ...... L159_161 37.5% png.lc:159
At the png.lc:53
we have the code that runs a decoding
loop:
read = (self, data):
data = self.z.decompress(data)
for byte in data
if self.new_scanline
self.filter = decode_filters[byte]
self.new_scanline = false
continue
if self.l == 0
u = 0
c = 0
if self.x - self.stride >= self.l
l = self.data[self.x - self.stride]
else
l = 0
else
u = self.data[self.x - self.y_stride]
if self.x - self.stride >= self.l
l = self.data[self.x - self.stride]
c = self.data[self.x - self.y_stride - self.stride]
else
l = 0
c = 0
self.data[self.x] = self.filter(byte, l, u, c)
self.x += 1
if self.x >= self.r
self.l += self.y_stride
self.r += self.y_stride
self.new_scanline = true
It tells that almost half of the time is spent in the png.lc:99.
decode_filters = {
4: (byte, l, u, c):
return byte + paeth_predictor(l, u, c)
}
paeth_predictor = (a, b, c):
p = a + b - c
pa = abs_int(p - a)
pb = abs_int(p - b)
pc = abs_int(p - c)
if pa <= pb and pa <= pc
return a
elif pb <= pc
return b
else
return c
# TODO: make them into multimethods
abs_int = (a):
if a < 0
return -a
return a
The Line 60 is the paeth_predictor
. And 73 is the
abs_int
.
Motive
There are many C libraries meant for loading PNG images, that I could have integrated into my runtime.
Lever is a gradually designed programming language. The observations made about the behavior of the language are used to improve it even further. The intention is to leave in the possibility for changing the language, so that the language can refine over time to meet on newly made discoveries.
One of my objectives with Lever is to make it into a high level programming language that obtains good runtime performance own its own, and then supply it with utilities to optimize those implementations to match the performance that is obtainable with the C language.
To do this anywhere else, we need to get started with something simple, and it would be preferable to be practical. Therefore I decided to write my own PNG implementation in Lever.
First the abs multimethod
The abs_int
is a crutch that I made because Lever's abs
-function only operates on floats. The limitation isn't very
elegant though. I've intended to remove it when it becomes
more relevant.
When you're optimizing code, it would be preferable to start
with a version that doesn't have any crutches originating
from unimplemented features in the programming language's
runtime, so the first obvious thing here is to remove
the abs_int
and replace it with properly implemented abs
multimethod.
The implementation of abs
is in the
runtime/vectormath.py
. As pointed out by the
documentation for abs().
We replace it with the multimethod implementation:
abs_ = Multimethod(1)
@abs_.multimethod_s(Float)
def abs_float(f):
return Float(-f.number) if f.number < 0.0 else f.number
@abs_.multimethod_s(Integer)
def abs_int(i):
return Integer(-i.value) if i.value < 0 else i
The Lever runtime needs to be retranslated for the changes
to take effect. We can remove the abs_int
and replace it
with abs
in the paeth_predictor.
Meanwhile we have time to think about the paeth predictor. When you look at the variable flow, every math operation appears to require the previous value. This is an important observation because we are running on a superscalar processor.
Although we are still somewhere on the dark side of the moon when it comes to the instruction-level optimizations. When I run the program again, it now takes 8 seconds to run.
100.0% L3_16 100.0% png_load.lc:3 (line 5)
100.0% .. L5_39 100.0% png.lc:5
99.4% .... L53_82 99.4% png.lc:53 (line 27)
0.9% ...... L96_99 0.9% png.lc:96
0.2% ...... L90_96 0.2% png.lc:90
32.4% ...... L99_101 32.6% png.lc:99 (line 56)
15.3% ........ L104_125 47.1% png.lc:104 (line 60)
0.1% ...... L87_90 0.1% png.lc:87
0.4% .... L136_153 0.4% png.lc:136
0.3% ...... L153_155 62.5% png.lc:153
0.1% .... L125_136 0.1% png.lc:125
0.1% ...... L153_155 100.0% png.lc:153
So in total we shaved 17% with that small change. But it is still the paeth predictor and the PNG decoding loop that's slowing us down.
Checking the JIT traces
The JIT trace is perhaps showing us something interesting.
The following command extracts it for us and places it into
the pnglog
:
PYPYLOG=jit-log-opt:pnglog lever samples/png/png_load.lc
This is the first time when I am using the JIT trace to optimize a program, so my function for the printable location only returned the program counter:
debug_merge_point(0, 0, 'pc=25 setloc module=<module png>')
debug_merge_point(0, 0, 'pc=29 getglob module=<module png>')
debug_merge_point(0, 0, 'pc=32 getloc module=<module png>')
debug_merge_point(0, 0, 'pc=35 getloc module=<module png>')
debug_merge_point(0, 0, 'pc=38 getglob module=<module png>')
debug_merge_point(0, 0, 'pc=41 call module=<module png>')
+1119: i119 = int_sub(i118, i110)
I adjusted it to provide line counts too. I'm sure it needs something better later on:
debug_merge_point(0, 0, 'pc=25 setloc module=<module png>:105')
debug_merge_point(0, 0, 'pc=29 getglob module=<module png>:105')
debug_merge_point(0, 0, 'pc=32 getloc module=<module png>:106')
debug_merge_point(0, 0, 'pc=35 getloc module=<module png>:106')
debug_merge_point(0, 0, 'pc=38 getglob module=<module png>:106')
debug_merge_point(0, 0, 'pc=41 call module=<module png>:106')
+1119: i119 = int_sub(i118, i110)
After dropping out lots of debug_merge_points
, I see that
the paeth predictor ends up like this:
+916: guard_nonnull_class(p3, ConstClass(Integer), descr=<Guard0x7effcc4a9db0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+941: guard_nonnull_class(p5, ConstClass(Integer), descr=<Guard0x7effcc4a9e08>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+966: guard_not_invalidated(descr=<Guard0x7effcc475e20>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+980: p108 = getfield_gc_r(ConstPtr(ptr107), descr=<FieldP space.multimethod.Multimethod.inst_version 48>)
+991: guard_value(p108, ConstPtr(ptr109), descr=<Guard0x7effcc475e60>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1000: i110 = getfield_gc_i(p3, descr=<FieldS space.numbers.Integer.inst_value 8 pure>)
+1011: i111 = getfield_gc_i(p5, descr=<FieldS space.numbers.Integer.inst_value 8 pure>)
+1022: i112 = int_add(i110, i111)
+1032: guard_nonnull_class(p7, ConstClass(Integer), descr=<Guard0x7effcc4a9e60>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1071: p115 = getfield_gc_r(ConstPtr(ptr114), descr=<FieldP space.multimethod.Multimethod.inst_version 48>)
+1089: guard_value(p115, ConstPtr(ptr116), descr=<Guard0x7effcc475ea0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1098: i117 = getfield_gc_i(p7, descr=<FieldS space.numbers.Integer.inst_value 8 pure>)
+1109: i118 = int_sub(i112, i117)
+1119: i119 = int_sub(i118, i110)
+1147: p121 = getfield_gc_r(ConstPtr(ptr120), descr=<FieldP space.multimethod.Multimethod.inst_version 48>)
+1165: guard_value(p121, ConstPtr(ptr122), descr=<Guard0x7effcc475ee0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1174: i124 = int_lt(i119, 0)
+1182: guard_false(i124, descr=<Guard0x7effcc475f20>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1188: i125 = int_sub(i118, i111)
+1198: i127 = int_lt(i125, 0)
+1202: guard_false(i127, descr=<Guard0x7effcc475f60>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1208: i128 = int_sub(i118, i117)
+1218: i130 = int_lt(i128, 0)
+1222: guard_false(i130, descr=<Guard0x7effcc475fa0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1235: p132 = getfield_gc_r(ConstPtr(ptr131), descr=<FieldP space.multimethod.Multimethod.inst_version 48>)
+1246: guard_value(p132, ConstPtr(ptr133), descr=<Guard0x7effcc4c0020>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1255: i134 = int_le(i119, i125)
+1262: guard_true(i134, descr=<Guard0x7effcc4c0060>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1268: i135 = int_le(i119, i128)
+1275: guard_true(i135, descr=<Guard0x7effcc4c00a0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1281: leave_portal_frame(0)
Superfluous guards
I even found something I should no longer have there:
+980: p108 = getfield_gc_r(ConstPtr(ptr107), descr=<FieldP space.multimethod.Multimethod.inst_version 48>)
+991: guard_value(p108, ConstPtr(ptr109), descr=<Guard0x7effcc475e60>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1071: p115 = getfield_gc_r(ConstPtr(ptr114), descr=<FieldP space.multimethod.Multimethod.inst_version 48>)
+1089: guard_value(p115, ConstPtr(ptr116), descr=<Guard0x7effcc475ea0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1147: p121 = getfield_gc_r(ConstPtr(ptr120), descr=<FieldP space.multimethod.Multimethod.inst_version 48>)
+1165: guard_value(p121, ConstPtr(ptr122), descr=<Guard0x7effcc475ee0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1235: p132 = getfield_gc_r(ConstPtr(ptr131), descr=<FieldP space.multimethod.Multimethod.inst_version 48>)
+1246: guard_value(p132, ConstPtr(ptr133), descr=<Guard0x7effcc4c0020>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
Every guard in JIT converts into a conditional and a conditional branch. They ensure that the program diverges from the fast path if it's not applicable. Making the version number pseudoimmutable should help a bit here:
class Multimethod(Object):
_immutable_fields_ = ['arity', 'multimethod_table', 'interface_table', 'default?', 'version?']
The pseudoimmutability is marked when we expect that a value does not change often. The version number changes when the multimethod is updated. From the perspective of hot loops it is never changed.
We still have lots of guards here.
+820: guard_nonnull_class(p3, ConstClass(Integer), descr=<Guard0x7f9c21d1a1d8>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+845: guard_nonnull_class(p5, ConstClass(Integer), descr=<Guard0x7f9c21d1a230>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+870: guard_not_invalidated(descr=<Guard0x7f9c21ccbf60>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+870: i107 = getfield_gc_i(p3, descr=<FieldS space.numbers.Integer.inst_value 8 pure>)
+888: i108 = getfield_gc_i(p5, descr=<FieldS space.numbers.Integer.inst_value 8 pure>)
+892: i109 = int_add(i107, i108)
+902: guard_nonnull_class(p7, ConstClass(Integer), descr=<Guard0x7f9c21d1a288>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+935: i111 = getfield_gc_i(p7, descr=<FieldS space.numbers.Integer.inst_value 8 pure>)
+939: i112 = int_sub(i109, i111)
+956: i113 = int_sub(i112, i107)
+970: i115 = int_lt(i113, 0)
+974: guard_false(i115, descr=<Guard0x7f9c21ccbfa0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+980: i116 = int_sub(i112, i108)
+997: i118 = int_lt(i116, 0)
+1001: guard_false(i118, descr=<Guard0x7f9c21d20020>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1007: i119 = int_sub(i112, i111)
+1017: i121 = int_lt(i119, 0)
+1021: guard_false(i121, descr=<Guard0x7f9c21d20060>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1027: i122 = int_le(i113, i116)
+1034: guard_true(i122, descr=<Guard0x7f9c21d200a0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1040: i123 = int_le(i113, i119)
+1047: guard_true(i123, descr=<Guard0x7f9c21d200e0>) [p0, p3, p5, p7, p9, p11, p13, p15, p18, p20, p22, p24, p26, p28, p30, p32, p34, p36, p38, p40, p42, p44, p46, p48, p50, p52, p54, p56, p58, p60, p62, p64, p66, p68, p70, p72, p74, p76, p78, p80, p82, p84, p86, p88, p90, p92, p94, p96, p98, p100, p102, p104]
+1053: leave_portal_frame(0)
And we are down to 7.1 seconds:
100.0% L3_16 100.0% png_load.lc:3 (line 5)
100.0% .. L5_39 100.0% png.lc:5
99.5% .... L53_82 99.5% png.lc:53 (line 27)
0.7% ...... L96_99 0.7% png.lc:96
0.1% ...... L87_90 0.1% png.lc:87
29.9% ...... L99_101 30.0% png.lc:99 (line 56)
12.3% ........ L104_128 41.2% png.lc:104 (line 60)
0.1% ...... L90_96 0.1% png.lc:90
0.2% .... L139_156 0.2% png.lc:139
0.1% ...... L156_158 25.0% png.lc:156
0.1% .... L40_53 0.1% png.lc:40
Somewhat illogically, the time spent in the paeth predictor increased!
Conditional branching
Another trick to try would be to remove the unnecessary
branching by coalescing the conditionals. To do it we need a
tool to do it. A select()
function does fine.
@builtin
@signature(Boolean, Object, Object)
def select(cond, a, b):
return a if cond == true else b
The optimizer will likely convert this thing into a conditional move, or so I hope. In every case it should eliminate additional branches that would turn into guards. The paeth_predictor looks like this now:
paeth_predictor = (a, b, c):
p = a + b - c
pa = abs(p - a)
pb = abs(p - b)
pc = abs(p - c)
return select(pa <= pb and pa <= pc,
a, select(pb <= pc, b, c))
The results were perhaps illogical. This was not any faster. When I looked into the trace, the guards are still there. Maybe our implementation has the wrong shape? Lets do it in a way that cannot be interpreted wrong:
@builtin
@signature(Boolean, Object, Object)
def select(cond, a, b):
return [b, a][cond == true]
Even then it does not have the desired effect. Forming a new bytecode instruction that does this shouldn't have any effect either because the call is constant folded away anyway.
Reader loop
One problem in our reader loop is that we are doing whole lot of work inside it. We are checking bounds and fetching stuff for the filter:
data = self.z.decompress(data)
for byte in data
if self.new_scanline
self.filter = decode_filters[byte]
self.new_scanline = false
continue
if self.l == 0
u = 0
c = 0
if self.x - self.stride >= self.l
l = self.data[self.x - self.stride]
else
l = 0
else
u = self.data[self.x - self.y_stride]
if self.x - self.stride >= self.l
l = self.data[self.x - self.stride]
c = self.data[self.x - self.y_stride - self.stride]
else
l = 0
c = 0
# run the filter and update the bounds
self.data[self.x] = self.filter(byte, l, u, c)
self.x += 1
if self.x >= self.r
self.l += self.y_stride
self.r += self.y_stride
self.new_scanline = true
We are doing that for every 1 228 800 bytes. Yet we only have 480 scanlines!
We could instead do:
prior = self.prior
scanline = self.scanline
index = self.index
data = self.z.decompress(data)
while data.length > 0
if index == 0
self.filter = decode_filters[data[0]]
data = data[1 .:]
L = min(self.y_stride - index, data.length)
for i in range(index, index+L)
if i < self.stride
l = 0
c = 0
else
l = scanline[i - self.stride]
c = prior[i - self.stride]
u = prior[i]
scanline[i] = data[i-index] + self.filter(l, u, c)
if index + L >= self.y_stride
prior = scanline
scanline = self.data[self.next_scanline .: self.next_scanline + self.y_stride]
index = 0
self.next_scanline += self.y_stride
else
index += L
data = data[L .:]
assert index < self.y_stride
self.prior = prior
self.scanline = scanline
self.index = index
To always have an empty prior scanline, we initialize it to point to the second scanline. I also flip the addition out of the filter that helps a little bit.
It now takes 3.6 seconds at loading those 10 images.
100.0% L3_17 100.0% png_load.lc:3 (line 5)
100.0% .. L5_39 100.0% png.lc:5
99.2% .... L53_85 99.2% png.lc:53 (line 231)
0.2% ...... L90_91 0.2% png.lc:90
1.5% ...... L96_97 1.5% png.lc:96
27.2% ...... L99_110 27.4% png.lc:99 (line 192)
0.5% ...... L93_94 0.5% png.lc:93
0.3% .... L133_150 0.3% png.lc:133
ImageMagick as comparison
To compare, if I run copy the image 10 times and then run:
time convert *.png converted.jpg
It shows me that it's taking 0.225s to do the job and it's not only reading the png but also converting it into jpg. We are at least 10 to 20 times slower than the C program and it is a bit unacceptable.
Optimization of method lookup
We looked at the trace again with pypy channel and found yet a lookup_method that was unelided. The likely reason was that the method table was not perceived as immutable. This resulted in the following silly pair of guards:
+600: p172 = getfield_gc_r(p1, descr=<FieldP space.customobject.CustomObject.inst_custom_interface 8 pure>)
+618: p175 = call_r(ConstClass(Interface.lookup_method), p172, ConstPtr(ptr174), descr=<Callr 8 rr EF=4>)
+710: guard_no_exception(descr=<Guard0x7f7b2d3a7d00>) [p0, p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12, p13, p14, p15, p16, p17, p18, p19, p20, p21, p22, p23, p24, p25, p26, p27, p28, p29, p30, p31, p32, p33, p34, p35, p36, p37, p38, p39, p40, p41, p42, p43, p44, p45, p46, p47, p48, p49, p50, p51, p52, p53, p54, p55, p56, p57, p58, p59, p60, p61, p62, p63, p64, p65, p66, p67, p68, p69, p70, p71, p72, p73, p74, p75, p76, p77, p78, p79, p80, p81, p82, p83, p84, p85, p86, p87, p88, p89, p90, p91, p92, p93, p94, p95, p96, p97, p98, p99, p100, p101, p102, p103, p104, p105, p106, p107, p114, p115, p116, p117, p118, p119, p120, p121, p122, p123, p124, p125, p126, p127, p128, p129, p130, p131, p132, p133, p134, p135, p136, p137, p138, p139, p140, p141, p142, p143, p144, p145, p146, p147, p148, p149, p150, p151, p152, p153, p154, p155, p156, p157, p158, p159, p160, p161, p162, p175, i171, i166]
+725: guard_isnull(p175, descr=<Guard0x7f7b2d3ac020>) [p0, p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12, p13, p14, p15, p16, p17, p18, p19, p20, p21, p22, p23, p24, p25, p26, p27, p28, p29, p30, p31, p32, p33, p34, p35, p36, p37, p38, p39, p40, p41, p42, p43, p44, p45, p46, p47, p48, p49, p50, p51, p52, p53, p54, p55, p56, p57, p58, p59, p60, p61, p62, p63, p64, p65, p66, p67, p68, p69, p70, p71, p72, p73, p74, p75, p76, p77, p78, p79, p80, p81, p82, p83, p84, p85, p86, p87, p88, p89, p90, p91, p92, p93, p94, p95, p96, p97, p98, p99, p100, p101, p102, p103, p104, p105, p106, p107, p114, p115, p116, p117, p118, p119, p120, p121, p122, p123, p124, p125, p126, p127, p128, p129, p130, p131, p132, p133, p134, p135, p136, p137, p138, p139, p140, p141, p142, p143, p144, p145, p146, p147, p148, p149, p150, p151, p152, p153, p154, p155, p156, p157, p158, p159, p160, p161, p162, p175, i171, i166]
Giving the JIT a hint to promote the custom_interface
into
a constant solves this problem. It means that the
lookup_method
can be then elided and the JIT can figure
out that it never returns anything other than null and
instead it just retrieves the value from a storage object.
After the improvement the program has shaved down to 2.7 seconds and the paeth filter appears to start becoming a bottleneck again.
100.0% L3_17 100.0% png_load.lc:3 (line 5)
100.0% .. L5_39 100.0% png.lc:5
98.7% .... L53_86 98.7% png.lc:53 (line 231)
0.1% ...... L91_92 0.1% png.lc:91
1.9% ...... L97_98 1.9% png.lc:97
31.8% ...... L100_111 32.2% png.lc:100 (line 192)
0.1% ...... L94_95 0.1% png.lc:94
1.0% .... L134_151 1.0% png.lc:134
0.1% ...... L151_153 14.3% png.lc:151
An insight into JIT
It is perhaps that I got some intuition about the behavior of the metatracing JIT when I found out about the above problem.
At the start of installing a JIT driver I have to tell it the color of certain variables:
jitdriver = jit.JitDriver(
greens=['pc', 'block', 'module', 'unit', 'excs', 'function'], #, 'ec'],
reds=['frame'], # 'regv',
virtualizables = ['frame'],
get_printable_location=get_printable_location,
get_unique_id=get_unique_id, is_recursive=True)
The need to specify red variables is an implementation detail here, but the green variables are needed by the JIT. The greens tell which variables are supposed to stay constant during the JIT translation.
When a trace meets a green variable, it puts a guard for it and treats the program as if it that variable remained constant in the program. The information allows the tracing JIT to do deduction and optimize the trace.
Variables that can be deduced to be constants, such as the constants inside constants can be implicitly marked green without a guard. In other hand if the guard fails, then the program needs to branch and has to trace again.
In a well-tuned JIT, methods that are not dynamically used are promoted into constants.
Shifting filter call out of the loop
It turns out that we can still squeeze some juice by restructuring the program. Lets take this loop inside the reader loop:
for i in range(index, index+L)
if i < self.stride
l = 0
c = 0
else
l = scanline[i - self.stride]
c = prior[i - self.stride]
u = prior[i]
scanline[i] = data[i-index] + self.filter(l, u, c)
And replace it with this:
self.filter(prior, scanline, data[.: L], self.stride, index)
Then we write several smaller loops like this:
# average
((prior, scanline, data, stride, offset):
for i in range(offset, offset+data.length)
u = prior[i]
if i >= stride
u = scanline[i - stride] + u
scanline[i] = data[i - offset] + u // 2
),
This shaves the runtime speed down to 1.9 seconds. It means that one image takes 200ms to load.
100.0% L3_17 100.0% png_load.lc:3 (line 5)
100.0% .. L5_39 100.0% png.lc:5
98.7% .... L53_78 98.7% png.lc:53 (line 231, with 289)
0.8% ...... L85_91 0.9% png.lc:85
4.4% ...... L98_104 4.5% png.lc:98
84.6% ...... L106_126 85.7% png.lc:106 (line 300)
0.8% ...... L93_96 0.9% png.lc:93
0.6% .... L149_166 0.6% png.lc:149
0.2% ...... L166_168 33.3% png.lc:166
The statistics show out, perhaps reassuringly, that the paeth filter constitutes 80% of the runtime. If we would somehow manage to shave it to tenth of its time then it would take only 550ms from the program to run the benchmark.
How much faster could it run?
Now there's an interesting question: If I optimized just the paeth-predictor loop, how fast could this program run?
If I temporarily disable the paeth loop, the app then takes 0.3s to run. So if that part of the program didn't run at all, that would be the limit that we cannot exceed by optimizing that only function in the program.
If we managed to optimize all of the inner loops, the vmprofile would then start to look like this:
100.0% L3_17 100.0% png_load.lc:3 (line 5)
10.0% .. L30_41 10.0% png_load.lc:30
86.0% .. L5_39 86.0% png.lc:5
6.0% .... L154_171 7.0% png.lc:154
76.0% .... L53_78 88.4% png.lc:53 (line 231, with 289)
The assembler
Motivated from the fast paeth filter, I decided to pick up the google/CPU-instructions dataset and implement an assembler that uses those tables.
If you are ever building an assembler remember that it helps a whole lot if you have a machine-readable table that tells the encoding and operands for every instruction.
While writing my own I also noticed that the encoding task is easy if you break it into two stages. In the first stage you fill up the parameters such as:
- rex byte
- operand_op field (for operands that embed into the opcode fields)
- modrm byte
- sib byte
- displacement offset
- immediate values passed in
Doing this kind of scanning through the operands makes it easier to determine what is the size of the displacement field or where the SIB byte belongs to.
The CPU-instructions also contains SSE and AVX instructions. And I expected they would require encoding of VEX or EVEX prefix. I didn't do it right yet, but I am sure to return on the subject sometime soon. In any case the proper use of those instructions requires feature detection code.
After looking at the mnemonics for a while, I decided to not use them and instead treat them as a form of documentation that describes what semantic the instruction follows.
Returns and branches that have far and near versions share the same mnemonic, so I found it a bit risky to just let a program select a certain instruction from the list and run with it. To work around this problem, I gave an unique identifier for every instruction in the list:
5931 XCHG AX r16 [64] [L] Exchange r16 with AX.
5932 XCHG EAX r32 [64] [L] Exchange r32 with EAX.
5933 XCHG RAX r64 [64] Exchange r64 with RAX.
5934 XCHG m16 r16 [64] [L] Exchange r16 with word from r/m16.
5935 XCHG m32 r32 [64] [L] Exchange r32 with doubleword from r/m32.
5936 XCHG m64 r64 [64] Exchange r64 with quadword from r/m64.
5937 XCHG m8 r8 [64] [L] Exchange r8 (byte register) with byte from r/m8.
5938 XCHG m8 r8 [64] Exchange r8 (byte register) with byte from r/m8.
5939 XCHG r16 AX [64] [L] Exchange AX with r16.
5940 XCHG r16 m16 [64] [L] Exchange word from r/m16 with r16.
5941 XCHG r16 r16 [64] [L] Exchange word from r/m16 with r16.
5942 XCHG r16 r16 [64] [L] Exchange r16 with word from r/m16.
5943 XCHG r32 EAX [64] [L] Exchange EAX with r32.
5944 XCHG r32 m32 [64] [L] Exchange doubleword from r/m32 with r32.
5945 XCHG r32 r32 [64] [L] Exchange doubleword from r/m32 with r32.
5946 XCHG r32 r32 [64] [L] Exchange r32 with doubleword from r/m32.
5947 XCHG r64 RAX [64] Exchange RAX with r64.
5948 XCHG r64 m64 [64] Exchange quadword from r/m64 with r64.
5949 XCHG r64 r64 [64] Exchange r64 with quadword from r/m64.
5950 XCHG r64 r64 [64] Exchange quadword from r/m64 with r64.
5951 XCHG r8 m8 [64] [L] Exchange byte from r/m8 with r8 (byte register).
5952 XCHG r8 m8 [64] Exchange byte from r/m8 with r8 (byte register).
5953 XCHG r8 r8 [64] [L] Exchange r8 (byte register) with byte from r/m8.
5954 XCHG r8 r8 [64] [L] Exchange byte from r/m8 with r8 (byte register).
5955 XCHG r8 r8 [64] Exchange r8 (byte register) with byte from r/m8.
5956 XCHG r8 r8 [64] Exchange byte from r/m8 with r8 (byte register).
Many of these are the same opcode but with different prefixes or flags. I don't have a method to calculate the size of these, and I do not have a heuristic on selecting them.
To represent Addresses in the memory operands, I picked the following format:
Address(type, offset, base=-1, index=-1, scale=1)
You can have the address without the base register if the additional 4 bytes from offset is tolerable.
Now I finally can emit instructions like this:
# 1006 LEA r64 m [64] Store effective address for m in register r64.
emit(1006, Register(i64, 11), Address(i64, 0, 10, 13))
# ABS on R11
# 1127 MOV r64 r64 [64] Move r/m64 to r64.
emit(1127, Register(i64, 13), Register(i64, 11))
# 1289 NEG r64 [64] Two's complement negate r/m64.
emit(1289, Register(i64, 11))
# 335 CMOVL r64 m64 [64] Move if less (SF≠ OF).
emit( 335, Register(i64, 11), Register(i64, 13))
I linked the loops by running the assembler twice and using ordinary variables for them:
emit( 878, Immediate(i8, loop_label - jl1_label))
jl1_label := output.length
As long as the instruction numbers and lengths do not change, the second run reaches the fixed point.
The paeth decoder written in the assembly such as above looks like this:
assert platform.arch == "x86_64"
"It just might not work on x86"
assert platform.name.startswith("linux")
"Also this depends on the SystemV calling conventions."
arg_0 = Register(i64, 7) # RDI
arg_1 = Register(i64, 6) # RSI
arg_2 = Register(i64, 2) # RDX
arg_3 = Register(i64, 1) # RCX
arg_4 = Register(i64, 8)
arg_5 = Register(i64, 9)
loop_label = 0
exit_label = 0
ja_label = 0
jl1_label = 0
jl2_label = 0
assemble = ():
# 1811 PUSH r64 [64] Push r64.
# we need additional register to clobber.
emit(1811, Register(i64, 3)) # x .r3
emit(1811, Register(i64, 10)) # .r10
emit(1811, Register(i64, 11)) # .r11
emit(1811, Register(i64, 12)) # .r12
emit(1811, Register(i64, 13)) # a .r13
emit(1811, Register(i64, 14)) # b .r14
emit(1811, Register(i64, 15)) # c .r15
# i = 0 .r0
# 5973 XOR m64 r64 [64] r/m64 XOR r64.
emit(5973, Register(i64, 0), Register(i64, 0))
emit(5973, Register(i64, 13), Register(i64, 13))
emit(5973, Register(i64, 15), Register(i64, 15))
# j = offset .arg_4(8)
# 1289 NEG r64 [64] Two's complement negate r/m64.
# 74 ADD r64 r64 [64] Add r64 to r/m64.
# k = j - stride .r1
emit(1289, Register(i64, 1))
emit( 74, Register(i64, 1), Register(i64, 8))
# 878 JL rel8 [64] [L] Jump short if less (SF≠ OF).
emit( 878, Immediate(i8, loop_label - jl1_label))
jl1_label := output.length
emit(1254, Register(i64, 13), Address(i8, 0, 7, 1)) # [prior .r7 + k .r1]
emit(1254, Register(i64, 15), Address(i8, 0, 6, 1)) # [scanline .r6 + k .r1]
# loop:
loop_label := output.length
# 0475 CMP m64 r64 [64] Compare r64 with r/m64.
# if i >= length goto exit
emit( 475, Register(i64, 0), arg_5)
# 0862 JAE rel32 [64] [L] Jump short if above or equal (CF=0).
emit( 862, Immediate(i32, exit_label - ja_label))
ja_label := output.length
## Test
# emit( 54, Address(i64, 0, arg_0.index), Immediate(i32, 1))
# 1254 MOVZX r64 m8 [64] Move byte to quadword, zero-extension.
# x = input[i]
emit(1254, Register(i64, 14), Address(i8, 0, 7, 8)) # [prior .r7 + j .r8]
# Starting paeth predictor here. a.r13, b.r14, c.r15
# 1006 LEA r64 m [64] Store effective address for m in register r64.
# 2327 SUB r64 m64 [64] Subtract r/m64 from r64.
#p = a + b - c p.r10
emit(1006, Register(i64, 10), Address(i64, 0, 13, 14))
emit(2327, Register(i64, 10), Register(i64, 15))
emit(1289, Register(i64, 10)) # negate to use LEA again.
# 335 CMOVL r64 m64 [64] Move if less (SF≠ OF).
# 1127 MOV r64 r64 [64] Move r/m64 to r64.
#pa = abs(a - p) p.r10 a.r13 pa.r11
emit(1127, Register(i64, 3), Register(i64, 13))
emit(1006, Register(i64, 11), Address(i64, 0, 10, 13)) # The a.r13 is free now.
# ABS on R11
emit(1127, Register(i64, 13), Register(i64, 11))
emit(1289, Register(i64, 11)) # negate to use CMOVL
emit( 335, Register(i64, 11), Register(i64, 13))
#pb = abs(b - p) p.r10 a.r13 pa.r12
emit(1006, Register(i64, 12), Address(i64, 0, 10, 14))
# ABS on R12
emit(1127, Register(i64, 13), Register(i64, 12))
emit(1289, Register(i64, 12)) # negate to use CMOVL
emit( 335, Register(i64, 12), Register(i64, 13))
# Now we want to compare and CMOVL if relevant.
emit( 475, Register(i64, 12), Register(i64, 11))
emit( 335, Register(i64, 11), Register(i64, 12))
emit( 335, Register(i64, 3), Register(i64, 14))
#pc = abs(c - p) p.r10 a.r13 pa.r11
emit(1006, Register(i64, 12), Address(i64, 0, 10, 15))
# ABS on R12
emit(1127, Register(i64, 13), Register(i64, 12))
emit(1289, Register(i64, 12)) # negate to use CMOVL
emit( 335, Register(i64, 12), Register(i64, 13))
# Now we want to compare and CMOVL if relevant.
emit( 475, Register(i64, 12), Register(i64, 11))
emit( 335, Register(i64, 11), Register(i64, 12))
emit( 335, Register(i64, 3), Register(i64, 15))
# Done with paeth, next we add.
#emit(1254, Register(i64, 3), Address(i8, 0, 2, 0)) # [input .r2 + i .r0]
# 78 ADD r8 m8 [64] [L] Add r/m8 to r8.
emit( 78, Register(i8, 3), Address(i8, 0, 2, 0)) # [input .r2 + i .r0]
# 1097 MOV m8 r8 [64] [L] Move r8 to r/m8.
# scanline[j] = x
emit(1097, Address(i8, 0, 6, 8), Register(i8, 3)) # [scanline .r6 + j .r8]
# 0072 ADD r64 imm8 [64] Add sign-extended imm8 to r/m64.
# i += 1
emit( 72, Register(i64, 0), Immediate(i8, 1))
# j += 1
emit( 72, Register(i64, 8), Immediate(i8, 1))
# clean a and c
emit(5973, Register(i64, 13), Register(i64, 13))
emit(5973, Register(i64, 15), Register(i64, 15))
# k += 1
emit( 72, Register(i64, 1), Immediate(i8, 1))
# 878 JL rel8 [64] [L] Jump short if less (SF≠ OF).
emit( 878, Immediate(i8, loop_label - jl2_label))
jl2_label := output.length
emit(1254, Register(i64, 13), Address(i8, 0, 7, 1)) # [prior .r7 + k .r1]
emit(1254, Register(i64, 15), Address(i8, 0, 6, 1)) # [scanline .r6 + k .r1]
# jump loop
# 0886 JMP rel32 [64] [L] Jump short, RIP = RIP + 8-bit displacement sign extended to 64-bits
emit( 886, Immediate(i32, loop_label - exit_label))
# exit:
exit_label := output.length
# Restore .r3 we clobbered.
# 1638 POP r64 [64] Pop top of stack into r64; increment stack pointer. Cannot encode 32-bit operand size.
emit(1638, Register(i64, 15))
emit(1638, Register(i64, 14))
emit(1638, Register(i64, 13))
emit(1638, Register(i64, 12))
emit(1638, Register(i64, 11))
emit(1638, Register(i64, 10))
emit(1638, Register(i64, 3))
# return
emit(1901) # RET NEAR
emit = (uid, args...):
output.extend(asm.encode_ins(uid, args))
output = []
assemble()
output = []
assemble()
At the point when I was going to write the actual algorithm the loop was supposed to run, I noticed that available registers almost ran out.
From here it's easy to run this in the place of the original paeth filter. We copy the assembler output into executable memory region and W^X it. I already have an utility for that:
buf = mman.Asmbuf(4096)
buf.memcpy(Uint8Array(output))
buf.finalize()
Next we need a function handle that we can call. We get one from the FFI:
c_type = ffi.cfunc(ffi.int, [
ffi.voidp, ffi.voidp, ffi.voidp, # prior, scanline, data
ffi.size_t,ffi.size_t,ffi.size_t]) # stride, offset, length
c_func = ffi.cast(buf, c_type)
The filter array requires a little rewrite so that the length -field is exposed like above, but otherwise this already works and we can hot-patch the assembler program into the png module:
import png
png.decode_filters[4] = c_func
main_png()
Now we can run the original benchmark.
Assembler results
With the monkey-patched assembly module, our original benchmark now takes 0.6 seconds to run, and the vmprofile report is very closely resembling the removal of paeth decoder:
100.0% L5_200 100.0% fast_paeth.lc:5
100.0% .. L203_217 100.0% fast_paeth.lc:203
6.6% .... L230_241 6.6% fast_paeth.lc:230
92.3% .... L5_39 92.3% png.lc:5
11.0% ...... L149_166 11.9% png.lc:149
3.3% ........ L166_168 30.0% png.lc:166
79.1% ...... L53_78 85.7% png.lc:53 (line 231, with 289)
1.1% ........ L93_96 1.4% png.lc:93
24.2% ........ L98_104 30.6% png.lc:98
4.4% ........ L85_91 5.6% png.lc:85
I did double check that the paeth decoder keeps working. I made a hexadecimal dump from the image and compared the data with the optimized version. Both of them look like this:
35 51 45 ff 36 46 3b ff 1b 1e 19 ff 18 1b 16 ff
33 4f 43 ff 27 3a 2f ff 1b 1e 19 ff 18 1b 16 ff
2c 4a 40 ff 2e 4c 42 ff 2f 4d 43 ff 34 4f 46 ff
27 45 3b ff 29 47 3d ff 2a 48 3e ff 28 48 3a ff
27 47 3c ff 26 44 3a ff 25 43 39 ff 23 43 39 ff
You can tell that this data is at least plausibly correct,
because of the full alpha channels separating the scanlines
into columns of ff
-symbols.
The closer inspection shows that the assembled version has some errors. It still runs through all the pixels so I'd believe the problem could be something subtle:
I won't use the assembled version right yet, so I leave the fix for later iterations.
Finally I also made a Vulkan demo that uses the JIT-optimized png module. Here's the screenshot:
Hand-written assembly is not practical
The assembler optimization is very successful considering that we don't even have to store any compiled executables here and we are still 4 times faster than within the JIT.
The problem is that the assembly module here cannot adjust for different calling conventions. It only works on Linux and on x86_64. There are other problems:
- We have two versions to maintain. And if we discard the old version then we cannot return to JIT-evaluating the code in the case it is more practical for some possible usecase.
- We had to rewrite the assembler version from the scratch when trying to optimize the program that way.
- We cannot access all GC records from the assembly. Only the ones that we could naturally pass into C libraries.
- The produced assembly code is calling-convention dependent and fixed to an operating system.
The practical solution would be to run type inference into the original program and then compile it down to assembly codes. It is the reason why I wrote the assembler in the first place, but it is also a subject for a second blog post.
Driving it through GLSL filter in Vulkan
Finally I made a vulkan application that uses the PNG loader. The full source code can be found at the samples/warpgpu/2_images.lc.
After the image has been loaded, we upload it to the GPU by memcopying into mapped location like this:
image.mem = gpu.mem.associate(image, [
"HOST_VISIBLE_BIT", "HOST_COHERENT_BIT"
])
data = image.mem.map(ffi.byte, image.offset, image.size)
ffi.memcpy(data, png_image.data, png_image.data.length)
Of course, this is a bit wasteful because the PNG loader wouldn't really need anything else but two latest scanlines. It could write into the scanline and the image simultaneously to load the image without ever holding the fully decompressed image in the memory.
It was nice to start with the simple usecase first though.
Finally I checked the image in the GLSL output. It producing the effect by shifting the texture coordinates horizontally using output from a voronoi noise generator:
Extra: Lets try SSE & SIMD instructions!
At first I thought I wouldn't go this far, but I still had Sunday to go and everyone reading this post would expect me to get into SIMD. So let's go!
The feature detection is mandatory if you intend to use
SIMD, but the first thing was to check what my CPU actually
supports. So I ran the cat /proc/cpuinfo
.
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz
stepping : 7
microcode : 0x70a
cpu MHz : 2003.000
cache size : 6144 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority dtherm
bugs :
bogomips : 5666.40
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
It seems that I can use all the instructions up to SSE4_1. Unfortunately the use of the SSE doesn't require the VEX prefix so I don't have a way to test whether I would get the VEX encoding right.
It's no wonder that stereotypical geeks are charismatic white males with fancy beards and no girlfriends! You would need very big wads of cash to stay updated on all the coolest hardware that pushes out of the factories at the constant feed.
Quick look into the instruction table tells that I have 565 SIMD operations available. When I discard the operations that work on floating point values, I end up with 361 operations to choose from.
Vectorizing the program
The first step is to look at the original program and see what we could do for it. To make this easier we break it to break it into extended basic blocks:
t = a + b
p = t - c
ta = p - a
pa = abs ta
tb = p - b
pb = abs tb
tc = p - c
pc = abs tc
c1 = le pa pb
c2 = le pa pc
c3 = and c1 c2
return a if c3
c4 = le pb pc
return b if c4
return c
It is surprisingly lot of operations after all. What we are about to do is called auto vectorization in compilers. I guess what we are specifically about to do is called loop vectorization.
One important thing is to note that the algorithm has not been defined on bytes. This means that we might get the wrong result if we keep the bytes packed in our registers. We may have to unpack so we are going to need instructions for those. Also we are going to need the instructions to load values into the xmm registers. I think the following is sufficient:
PACKSSDW PACKSSDW PACKSSDW PACKSSDW PACKSSWB PACKSSWB PACKSSWB PACKSSWB PACKUSWB PACKUSWB PACKUSWB PACKUSWB
PUNPCKHBW PUNPCKHBW PUNPCKHBW PUNPCKHBW PUNPCKHDQ PUNPCKHDQ PUNPCKHDQ
PUNPCKHDQ PUNPCKHQDQ PUNPCKHQDQ PUNPCKHWD PUNPCKHWD PUNPCKHWD PUNPCKHWD
PUNPCKLBW PUNPCKLBW PUNPCKLBW PUNPCKLBW PUNPCKLDQ PUNPCKLDQ PUNPCKLDQ
PUNPCKLDQ PUNPCKLQDQ PUNPCKLQDQ PUNPCKLWD PUNPCKLWD PUNPCKLWD PUNPCKLWD
MOVDQA MOVDQU
Second I've been looking up what I can find from our instruction table of valid operations. I found the following:
t = a + b PADDB, PADDD, PADDQ, PADDSB, PADDSW, PADDUSB, PADDUSW, PADDW, PADDW, PALIGNR [m64, m128]
p = t - c PSUBB, PSUBD, PSUBQ, PSUBSB, PSUBSW, PSUBUSB, PSUBUSW, PSUBW
ta = p - a
pa = abs ta PABSB, PABSD PABSW [m64, m128]
tb = p - b
pb = abs tb
tc = p - c
pc = abs tc
c1 = le pa pb PCMPGTB, PCMPGTW, PCMPGTD [m64, m128]
c2 = le pa pc
c3 = and c1 c2 PAND, PANDN [m64, m128]
return a if c3
c4 = le pb pc
return b if c4
return c
It appears that we have m128 for everything so we can use
the SSE2 instructions everywhere here. We would also have
PHADDW
to add packed short integers horizontally.
The SIMD instructions do not have conditional operations, so instead we have to use a trick on them.
return a if c3 PAND, PANDN, POR, PXOR [m64, m128]
c4 = le pb pc
return b if c4
return c
Here we have everything we need to carry this out now. The final program:
result = 1168 MOVDQU [input + i - offset]
a = 1164 MOVDQU [scanline + i - stride]
b = 1168 MOVDQU [prior + i]
c = 1164 MOVDQU [prior + i - stride]
al = 1786 PUNPCKLBW a
bl = 1786 PUNPCKLBW b
cl = 1786 PUNPCKLBW c
ah = 1772 PUNPCKHBW a
bh = 1772 PUNPCKHBW b
ch = 1772 PUNPCKHBW c
pal = 1165 MOVDQA al
pah = 1165 MOVDQA ah
pal += 1395 PADDD bl
pah += 1395 PADDD bh
pal -= 1742 PSUBD al
pah -= 1742 PSUBD ah
pbl = 1165 MOVDQA pal
pbh = 1165 MOVDQA pah
pcl = 1165 MOVDQA pal
pch = 1165 MOVDQA pah
pal -= 1742 PSUBD al
pbl -= 1742 PSUBD bl
pcl -= 1742 PSUBD cl
pah -= 1742 PSUBD ah
pbh -= 1742 PSUBD bh
pch -= 1742 PSUBD ch
pal = 1369 PABSD pal
pbl = 1369 PABSD pbl
pcl = 1369 PABSD pcl
pah = 1369 PABSD pah
pbh = 1369 PABSD pbh
pch = 1369 PABSD pch
c1l = 1472 PCMPGTD pbl pal
c2l = 1472 PCMPGTD pcl pal
c4l = 1472 PCMPGTD pcl pbl
c1h = 1472 PCMPGTD pbh pah
c2h = 1472 PCMPGTD pch pah
c4h = 1472 PCMPGTD pch pbh
c3l = 1427 PAND c1l c2l
c3h = 1427 PAND c1h c2h
bl = 1427 PAND bl c4l
bh = 1427 PAND bh c4h
cl = 1431 PANDN cl, c4l
ch = 1431 PANDN ch, c4h
bl = 1650 POR bl, cl
bh = 1650 POR bh, ch
al = 1427 PAND al, c3l
ah = 1427 PAND ah, c3h
bl = 1431 PANDN bl, c3l
bh = 1431 PANDN bh, c3h
al = 1650 POR al, bl
ah = 1650 POR ah, bh
delta = 1387 PACKUSWB al, ah
result += 1391 PADDB delta
1167 MOVDQU [scanline], result
It's ridiculous! 54 operations that run on 16 bytes of data at once. Next we should allocate registers and produce the program. On Linux we are free to use the whole range of XMM registers without having to store them, so we do so.
simd_paeth_assembly = (emit, input_addr, a_addr, b_addr, c_addr, scanline_addr):
xmm0 = Register(m128, 0)
xmm1 = Register(m128, 1)
xmm2 = Register(m128, 2)
xmm3 = Register(m128, 3)
xmm4 = Register(m128, 4)
xmm5 = Register(m128, 5)
xmm6 = Register(m128, 6)
xmm7 = Register(m128, 7)
xmm8 = Register(m128, 8)
xmm9 = Register(m128, 9)
xmm10 = Register(m128, 10)
xmm11 = Register(m128, 11)
xmm12 = Register(m128, 12)
xmm13 = Register(m128, 13)
xmm14 = Register(m128, 14)
xmm15 = Register(m128, 15)
# register allocation
result = xmm0
al = xmm1
bl = xmm2
cl = xmm3
ah = xmm4
bh = xmm5
ch = xmm6
pal = xmm7
pbl = xmm8
pcl = xmm9
pah = xmm10
pbh = xmm11
pch = xmm12
# result = 1164 MOVDQU [input + i - offset]
emit(1168, result, input_addr)
# a = 1168 MOVDQU [scanline + i - stride]
emit(1164, al, a_addr)
# b = 1168 MOVDQU [prior + i]
emit(1168, bl, b_addr)
# c = 1168 MOVDQU [prior + i - stride]
emit(1164, cl, c_addr)
# ah = 1772 PUNPCKHBW a
# bh = 1772 PUNPCKHBW b
# ch = 1772 PUNPCKHBW c
emit(1772, ah, al)
emit(1772, bh, bl)
emit(1772, ch, cl)
# al = 1786 PUNPCKLBW a
# bl = 1786 PUNPCKLBW b
# cl = 1786 PUNPCKLBW c
emit(1786, al, al)
emit(1786, bl, bl)
emit(1786, cl, cl)
# pal = 1165 MOVDQA al
# pah = 1165 MOVDQA ah
emit(1165, pal, al)
emit(1165, pah, ah)
# pal += 1395 PADDD bl
# pah += 1395 PADDD bh
emit(1395, pal, bl)
emit(1395, pah, bh)
# pal -= 1742 PSUBD al
# pah -= 1742 PSUBD ah
emit(1742, pal, al)
emit(1742, pah, ah)
# pbl = 1165 MOVDQA pal
# pbh = 1165 MOVDQA pah
# pcl = 1165 MOVDQA pal
# pch = 1165 MOVDQA pah
emit(1165, pbl, pal)
emit(1165, pbh, pah)
emit(1165, pcl, pcl)
emit(1165, pch, pch)
# pal -= 1742 PSUBD al
# pbl -= 1742 PSUBD bl
# pcl -= 1742 PSUBD cl
# pah -= 1742 PSUBD ah
# pbh -= 1742 PSUBD bh
# pch -= 1742 PSUBD ch
emit(1742, pal, al)
emit(1742, pbl, bl)
emit(1742, pcl, cl)
emit(1742, pah, ah)
emit(1742, pbh, bh)
emit(1742, pch, ch)
# pal = 1369 PABSD pal
# pbl = 1369 PABSD pbl
# pcl = 1369 PABSD pcl
# pah = 1369 PABSD pah
# pbh = 1369 PABSD pbh
# pch = 1369 PABSD pch
emit(1369, pal, pal)
emit(1369, pbl, pbl)
emit(1369, pcl, pcl)
emit(1369, pah, pah)
emit(1369, pbh, pbh)
emit(1369, pch, pch)
# We would need at least 4 registers more, so
# we need to figure out which ones we can reuse.
c1l = xmm13
c1h = xmm14
c2l = xmm15
# If we do some reordering here, we can
# reuse pal
# c1l = 1472 PCMPGTD pbl pal
# c1h = 1472 PCMPGTD pbh pah
# c2l = 1472 PCMPGTD pcl pal
# c2h = 1472 PCMPGTD pch pah
emit(1164, c1l, pbl)
emit(1472, c1l, pal)
emit(1164, c2l, pcl)
emit(1472, c2l, pal)
c2h = pal
emit(1164, c2h, pch)
emit(1472, c2h, pah)
emit(1164, c1h, pbh)
emit(1472, c1h, pah)
# c4l = 1472 PCMPGTD pcl pbl
# c4h = 1472 PCMPGTD pch pbh
# We can reuse pcl/pch for c4l, c4h
c4l = pcl # emit(1164, c4l, pcl)
c4h = pch # emit(1164, c4h, pch)
emit(1472, c4l, pbl)
emit(1472, c4h, pbh)
# c3l = 1427 PAND c1l c2l
# c3h = 1427 PAND c1h c2h
# We can reuse c1* for c3*
c3l = c1l
c3h = c1h
emit(1427, c3l, c2l)
emit(1427, c3h, c2h)
# bl = 1427 PAND bl c4l
# bh = 1427 PAND bh c4h
emit(1427, bl, c4l)
emit(1427, bh, c4h)
# cl = 1431 PANDN cl, c4l
# ch = 1431 PANDN ch, c4h
emit(1431, cl, c4l)
emit(1431, ch, c4h)
# bl = 1650 POR bl, cl
# bh = 1650 POR bh, ch
emit(1650, bl, cl)
emit(1650, bh, ch)
# al = 1427 PAND al, c3l
# ah = 1427 PAND ah, c3h
emit(1427, al, c3l)
emit(1427, ah, c3h)
# bl = 1431 PANDN bl, c3l
# bh = 1431 PANDN bh, c3h
emit(1431, bl, c3l)
emit(1431, bh, c3h)
# al = 1650 POR al, bl
# ah = 1650 POR ah, bh
emit(1650, al, bl)
emit(1650, ah, bh)
# delta = 1387 PACKUSWB al, ah
delta = al
emit(1387, delta, ah)
# result += 1391 PADDB delta
emit(1391, result, delta)
# 1168 MOVDQU [scanline + i], result
emit(1167, scanline_addr, result)
return result
The dump of the above program is 288 bytes long. A quick
look with ndisasm
revealed that the assembler works. Now
we have to install the above thing into a loop to have it
run. We can reuse most of our old loop:
assemble = ():
# 1811 PUSH r64 [64] Push r64.
# we no longer clobber any registers we should save.
# i = 0 .r0
# 5973 XOR m64 r64 [64] r/m64 XOR r64.
emit(5973, Register(i64, 0), Register(i64, 0))
# j = offset .arg_4(8)
# 1289 NEG r64 [64] Two's complement negate r/m64.
# 74 ADD r64 r64 [64] Add r64 to r/m64.
# k = j - stride .r1
emit(1289, Register(i64, 1))
emit( 74, Register(i64, 1), Register(i64, 8))
# loop:
loop_label := output.length
# 0475 CMP m64 r64 [64] Compare r64 with r/m64.
# if i >= length goto exit
emit( 475, Register(i64, 0), arg_5)
# 0862 JAE rel32 [64] [L] Jump short if above or equal (CF=0).
emit( 862, Immediate(i32, exit_label - ja_label))
ja_label := output.length
simd_paeth_assembly(emit,
Address(m128, 0, 2, 0), # input = [input .r2 + i .r0]
Address(m128, 0, 6, 1), # a = [scanline .r6 + k .r1]
Address(m128, 0, 7, 8), # b = [prior .r7 + j .r8]
Address(m128, 0, 7, 1), # c = [prior .r7 + k .r1]
Address(m128, 0, 6, 8)) # output = [scanline .r6 + j .r8]
# i, j, k += 16
emit( 72, Register(i64, 0), Immediate(i8, 16))
emit( 72, Register(i64, 8), Immediate(i8, 16))
emit( 72, Register(i64, 1), Immediate(i8, 16))
# jump loop
# 0886 JMP rel32 [64] [L] Jump short, RIP = RIP + 8-bit displacement sign extended to 64-bits
emit( 886, Immediate(i32, loop_label - exit_label))
# exit:
exit_label := output.length
# return
emit(1901) # RET NEAR
Now, unlike the original assembly program, we can't use this new implementation in isolation because it only runs through 16-byte chunks at a time.
We can't use this new implementation just as it because it churns through 16-byte chunks in one swoop. Therefore we need the JIT version to meet up the beginning and end, so these two have to be combined:
make_new_filter = (optimized, ordinary):
return (prior, scanline, input, stride, offset, length):
# max(0, offset - stride) + 16 must be filled.
fast_offset = max(max(0, offset-stride) + 16, offset)
fast_index = fast_offset - offset
remain_length = length - fast_index
last_length = remain_length & 16
fast_length = remain_length - last_length
last_index = length - last_length
last_offset = offset + last_index
ordinary(prior, scanline, input,
stride, offset, fast_index)
optimized(prior, scanline, input[fast_index .:],
stride, fast_offset, fast_length)
ordinary(prior, scanline, input[last_index .:],
stride, last_offset, last_length)
That was already a bit tricky to write.
Debugging
When we run the code we notice it produces a wrong output. Here's the first two scanlines of the JIT-only version:
7f 83 82 ff 7c 80 7f ff 7f 83 82 ff 7b 7f 7e ff ... 48 52 2f ff 4f 59 36 ff 50 5a 37 ff 4e 58 35 ff
7e 82 81 ff 7b 7f 7e ff 7e 82 81 ff 7a 7e 7d ff ... 50 5a 37 ff 57 61 3e ff 57 61 3e ff 52 5c 39 ff
And here's our JIT+SIMD version:
7f 83 82 ff 7c 80 7f ff 7f 83 82 ff 7b 7f 7e ff ... 48 52 2f ff 4f 59 36 ff 50 5a 37 ff 4e 58 35 ff
7e 82 81 ff 7b 7f 7e ff 7e 82 81 ff 7a 7e 7d ff ... 07 07 07 00 0e 0e 0e 00 0e 0e 0e 00 09 09 09 00
It looks a lot like we would have something in our paeth. The first line looks okay, but actually it's running a different filter there. Our filter runs on the second row. Fortunately we can define a software breakpoint.
# 852 INT 3 [64] [L] Interrupt 3-trap to debugger.
emit( 852, null)
But maybe I haste here because my display shows only 16 bytes from the beginning and the end! This means that on the left side our optimized program hasn't been running at all.
So our filter is supposed to fill the first 16 bytes so that the actual program can start there like this:
pre-fill: AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA
scanline+k: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
scanline+j: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
But then we actually start the optimized routine like this:
pre-fill: AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA
scanline+k: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
scanline+j: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Oops. Here's the fixed version.
make_new_filter = (optimized, ordinary):
return (prior, scanline, input, stride, offset, length):
if length < 32 # if it's so small chunk that our thing breaks
return ordinary(prior, scanline, input, stride, offset, length)
# max(0, offset - stride) + 16 must be filled.
fill_offset = max(max(0, offset-stride) + 16, offset)
fast_offset = max(offset, stride)
fast_index = fast_offset - offset
remain_length = length - fast_index
last_length = remain_length & 15
fast_length = remain_length - last_length
last_index = length - last_length
last_offset = offset + last_index
ordinary(prior, scanline, input,
stride, offset, fill_offset - offset)
optimized(prior, scanline, input[fast_index .:],
stride, fast_offset, fast_length)
ordinary(prior, scanline, input[last_index .:],
stride, last_offset, last_length)
It still gives the wrong results but they are more apparent now. Next we use that trap instruction I shown and run the following in the GDB:
display /x $xmm0.uint128
display /x $xmm1.uint128
display /x $xmm2.uint128
display /x $xmm3.uint128
display /x $xmm4.uint128
display /x $xmm5.uint128
display /x $xmm6.uint128
display /x $xmm7.uint128
display /x $xmm8.uint128
display /x $xmm8.uint128
display /x $xmm9.uint128
display /x $xmm10.uint128
display /x $xmm11.uint128
display /x $xmm12.uint128
display /x $xmm13.uint128
display /x $xmm14.uint128
display /x $xmm15.uint128
layout asm
After the MOVDQU instructions the xmm0-xmm3 looks like this.
1: /x $xmm0.uint128 = 0x00fefefe00ffffff00ffffff00ffffff
2: /x $xmm1.uint128 = 0xff7d7e7aff81827eff7e7f7bff81827e
3: /x $xmm2.uint128 = 0xff7d7e7aff7e7f7bff82837fff7f807c
4: /x $xmm3.uint128 = 0xff7e7f7bff82837fff7f807cff82837f
The xmm0 looks very familiar. I think it'll be apparent very soon why this code produces the wrong results.
After the punpck{lbw,hbw} have run, the register dump looks like the following:
1: /x $xmm0.uint128 = 0x00fefefe00ffffff00ffffff00ffffff
2: /x $xmm1.uint128 = 0xffff7e7e7f7f7b7bffff818182827e7e
3: /x $xmm2.uint128 = 0xffff828283837f7fffff7f7f80807c7c
4: /x $xmm3.uint128 = 0xffff7f7f80807c7cffff828283837f7f
5: /x $xmm4.uint128 = 0xff007d007e007a00ff00810082007e00
6: /x $xmm5.uint128 = 0xff007d007e007a00ff007e007f007b00
7: /x $xmm6.uint128 = 0xff007e007f007b00ff00820083007f00
It is suddenly painfully apparent that the destination and source register in the unpack operation cannot be the same. So we have to fix those lines like this:
# ah = 1817 PXOR ah
# bh = 1817 PXOR bh
# ch = 1817 PXOR ch
# al = 1817 PXOR al
# bl = 1817 PXOR bl
# cl = 1817 PXOR cl
emit(1817, al, al)
emit(1817, bl, bl)
emit(1817, cl, cl)
emit(1817, ah, ah)
emit(1817, bh, bh)
emit(1817, ch, ch)
# ah = 1772 PUNPCKHBW a
# bh = 1772 PUNPCKHBW b
# ch = 1772 PUNPCKHBW c
emit(1772, ah, pal)
emit(1772, bh, pbl)
emit(1772, ch, pcl)
# al = 1786 PUNPCKLBW a
# bl = 1786 PUNPCKLBW b
# cl = 1786 PUNPCKLBW c
emit(1786, al, pal)
emit(1786, bl, pbl)
emit(1786, cl, pcl)
Another iteration of the first entry with a debugger reveals
that we calculate a + b - a
rather than a + b - c
like
we should. This fixes that:
# pal -= 1742 PSUBD cl
# pah -= 1742 PSUBD ch
emit(1742, pal, cl)
emit(1742, pah, ch)
It seems that at the end we end up adding with zero, so that we are actually doing nothing in this program. The only reasonable explanation here is that we mess up the conditionals somehow.
After the end of the conditional block, the result looks like this:
1: /x $xmm0.uint128 = 0x00fefefe00ffffff00ffffff00ffffff (result)
2: /x $xmm1.uint128 = 0xff007e007f007b00ff00810082007e00 (al)
3: /x $xmm2.uint128 = 0xff00820083007f00ff007f0080007c00 (bl)
4: /x $xmm3.uint128 = 0xff007f0080007c00ff00820083007f00 (cl)
5: /x $xmm4.uint128 = 0xff007d007e007a00ff00810082007e00 (ah)
6: /x $xmm5.uint128 = 0xff007d007e007a00ff007e007f007b00 (bh)
7: /x $xmm6.uint128 = 0xff007e007f007b00ff00820083007f00 (ch)
8: /x $xmm7.uint128 = 0xffffffffffffffffffffffffffffffff (c2h)
10: /x $xmm8.uint128 = 0x00000100010001000000010001000100 (pbl)
11: /x $xmm9.uint128 = 0xffffffffffffffffffffffffffffffff (c4l)
12: /x $xmm10.uint128 = 0x00000100010001000000040004000400 (pah)
13: /x $xmm11.uint128 = 0x00000100010001000000010001000100 (pbh)
14: /x $xmm12.uint128 = 0xffffffffffffffffffffffffffffffff (c4h)
15: /x $xmm13.uint128 = 0x00000000000000000000000000000000 (c3l)
16: /x $xmm14.uint128 = 0x00000000000000000000000000000000 (c3h)
17: /x $xmm15.uint128 = 0xffffffffffffffffffffffffffffffff (c2l)
The labels reveal that we have the following state:
al, bl, cl, ah, bh, ch
c3 = false
c4 = true
This should mean that the result b
is selected. The
following instructions should correctly preserve the result.
# bl = 1427 PAND bl c4l
# bh = 1427 PAND bh c4h
emit(1427, bl, c4l)
emit(1427, bh, c4h)
They did so. Respectively these next ones should zero the cl and ch.
# cl = 1431 PANDN cl, c4l
# ch = 1431 PANDN ch, c4h
emit(1431, cl, c4l)
emit(1431, ch, c4h)
I think I messed up something again.. Reading out on this reveals that the PANDN inverts the destination and then does AND. It's not exactly what we expected. We have to flip them over and merge with the c3 and c4.
The result is still completely wrong:
7f 83 82 ff 7c 80 7f ff 7f 83 82 ff 7b 7f 7e ff ... 48 52 2f ff 4f 59 36 ff 50 5a 37 ff 4e 58 35 ff
7e 82 81 ff fe ff fe 00 fe ff ff 00 fe fe fe 00 ... 06 06 06 00 0d 0d 0d 00 0d 0d 0d 00 08 08 08 00
One of the culprits could be the PCMPGTD
. We use it like
this:
c1 = PCMPGTD pb pa
If this actually means 'greater than', then the code is:
c1 = pb > pa
And the condition was:
if pa <= pb and pa <= pc
return a
elif pb <= pc
return b
else
return c
So there we have our problem. I had to think about this a bit and transform it into the correct form.
if not (pa > pb or pa > pc)
return a
elif not pb > pc
return b
else
return c
Even then the code is not correct yet. Let's look at the input next. It looks weird because it's upside down. Let's label them and compare them with what we know.
input: /x $xmm0.uint128 = 0x 00 fe fe fe 00 ff ff ff 00 ff ff ff 00 ff ff ff
a: /x $xmm7.uint128 = 0x ff 7d 7e 7a ff 81 82 7e ff 7e 7f 7b ff 81 82 7e
b: /x $xmm8.uint128 = 0x ff 7d 7e 7a ff 7e 7f 7b ff 82 83 7f ff 7f 80 7c
c: /x $xmm9.uint128 = 0x ff 7e 7f 7b ff 82 83 7f ff 7f 80 7c ff 82 83 7f
prior: 7f 83 82 ff 7c 80 7f ff 7f 83 82 ff 7b 7f 7e ff ... 48 52 2f ff 4f 59 36 ff 50 5a 37 ff 4e 58 35 ff
scanline: 7e 82 81 ff 7b 7f 7e ff 7e 82 81 ff 7a 7e 7d ff ... 50 5a 37 ff 57 61 3e ff 57 61 3e ff 52 5c 39 ff
This is precise. Next the data is packed into the 6 registers. We see them being cleaned:
al: /x $xmm1.uint128 = 0x00000000000000000000000000000000
bl: /x $xmm2.uint128 = 0x00000000000000000000000000000000
cl: /x $xmm3.uint128 = 0x00000000000000000000000000000000
ah: /x $xmm4.uint128 = 0x00000000000000000000000000000000
bh: /x $xmm5.uint128 = 0x00000000000000000000000000000000
ch: /x $xmm6.uint128 = 0x00000000000000000000000000000000
Then they get filled up with data:
al: /x $xmm1.uint128 = 0xff00 7e00 7f00 7b00 ff00 8100 8200 7e00
bl: /x $xmm2.uint128 = 0xff00 8200 8300 7f00 ff00 7f00 8000 7c00
cl: /x $xmm3.uint128 = 0xff00 7f00 8000 7c00 ff00 8200 8300 7f00
ah: /x $xmm4.uint128 = 0xff00 7d00 7e00 7a00 ff00 8100 8200 7e00
bh: /x $xmm5.uint128 = 0xff00 7d00 7e00 7a00 ff00 7e00 7f00 7b00
ch: /x $xmm6.uint128 = 0xff00 7e00 7f00 7b00 ff00 8200 8300 7f00
So the problem appears to be that we do the wrong kind of unpacking operation here. We can fix it with PSRLW which shifts bytes right.
The beginning of the sequnce started to look right at this point, but it got the rest wrong. I started to realise there is something fairly wrong here.
Ok. So we fill our first run like this:
7e 82 81 ff .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Then we jump ahead by 16 bytes and we start reading the input from here:
7e 82 81 ff .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^
You know what? At this point I realised I cannot do it this way. I can only compute contents on one pixel parallel using this parallelization technique. And it's messy considering what it does.
I found yet some more things I did wrong, such as the use of double word operations when I should have used operations on words. After finding those I started getting results that started to seem plausible but compared to the correct output they were still incorrect.
Lets leave it to the backburner
This was the first time when I've studied the SIMD instructions in detail. The outcome was a complete mess but I learned so much in the process that I'm not even frustrated.
The techniques presented here obviously work, but I need to write a compiler to make the machine code solution more practical as a choice.