touko: Continuing the discussion...
I was looking into doing fast multiplication on the PCE. Stef mentioned the 68k can get 70 cycles for 16bit * 16bit -> 32bit.
I looked up some routines and came across the old c-64 code for fast mul. I've seen this before, a number of years back. But I never had a need for it. Almost all my multiplication in code were of usually one element being a constant and optimized as such. But for something else that I started, I needed variable values for both A and B.
The fast mul routine is based on f(a+b)-f(a-b). Where f(x)=x^2/4. If you break the multiplication down into 8bit steps, a+b=9bit result. So f(a+b) is a 9bit (512) WORD wide LUT. This breaks it down into simple additions and subtractions (albeit 16bit add/sub operations). I have a few ideas how to speed this up further, but I have to write the code out and compare cycles.