ARM32 really doesn't have enough regs to be cavalier about it for this kind of code... 16 GPRs, minus PC and probably the SP too. The reset we can definitely spill, but still, that leaves 14, and you probably lose another 2 for loop counters (unless you fully unroll a block) and temps. For each interpolated attribute that we want to leave fully in regs, we need 3 registers (current value, x-step, y-step) to a first-order approximation. If you naively take the 3 edge equations plus u/w, v/w, 1/w, that's already 2 too many for the partial blocks (the full blocks don't need edge eqs).



With a bit of spilling (specifically, spilling the y-steps), you can get 2 registers per attribute at some cost. You can also use that for the three half-edge functions, c1+c2+c3=const., so you can rewrite "c3 >= 0" as "C - (c1+c2) >= 0" <=> " C >= c1 + c2". Some extra work per pixel but it means c3 only needs one reg "permanently" (holding C) and one temp.



You can break out the old SW rendering bag of tricks and start packing multiple values into registers and effectively do SIMD with regular integer ops. That's a fine technique to polish a good loop but I wouldn't recommend going down that route until you're sure you got the basic algorithm down, as it makes further changes painful.



You can use a 2DH bary rasterizer, using barycentric u/w, v/w, 1/w - which correspond to c1, c2 and c1+c2+c3. This makes stepping cheaper and frees up lots of regs, but recovering the interpolated texture coords per pixel now requires 2 mul-adds. Worth considering if you have fast mul-add, not so great otherwise (i.e. not something you'd want on ARM, generally).



What I described in my prev post is just evaluating 1/w *once*, at the middle of the block, and pretending it's constant. One divide and a couple of muls get you all you need to do interpolation setup from that. This is nice and simple but the UVs at the block corners generally won't match. How bad that is depends on how steep the tri is. Worth trying for 4x4, I'd suspect it looks pretty crappy for 8x8.



If you take 1/w at the corners, you still shouldn't compute the 4 corners for every block - adjacent blocks share corners (after all, that's the whole reason for doing this in the first place - you want the corners to match). If your main block sweep is left-right, it's easy to get down to only computing 2 corner values (thus 2 divs) per block, and recycling 2 values from the previous block (only computing them from scratch if you step down a block row). You can get down to only computing 1 value in most blocks if you also cache the corner values for the previous block row so you don't need to recompute them next time. This gets messy though since there's no guarantee there was a block right above the current one in the previous block row, so you need to keep track of which of the cached corner values are actually valid (it's always a single interval, but still... ugh).



Finally, some pedantry: Affine interpolation is linear interpolation. Not bilinear, certainly not trilinear. If you compute 4 arbitrary values at the block corners and interpolate between them, you'll generally end up with bilinear, but still not trilinear. :)