This is from a discussion from Steve Snake, me, Chilly Willy, Exophase. We were pretty much putting popular claims to the test.
This was a segment of example code Steve Snake wrote (note: before he made the famous Kega/Fusion emulator, he was a programmer for the MD and other platforms). It's a velocity update routine for an object (both X and Y directions):
68k:
4 lea address.w,a0 ;8/12
2 bsr ;18
2 move.l (a0)+,d0 ;12
2 add.l d0,(a0) ;20
2 move.l (a1)+,d0 ;12
2 add.l d0,(a1) ;20
2 rts ;18. 64+36=100+8=108(112)
16/7
;6280 object
2 ldx #$xx ;2
3 jsr AddVelocity ;7
AddVelocity:
3 lda x_float,x ;5
1 clc ;2
3 adc <x_float_inc,x ;4
3 sta x_float,x ;5
3 lda x_whole.l,x ;5
3 adc <x_whole_inc,x ;4
3 sta x_whole.l,x ;5
3 lda x_whole.h,x ;5
2 adc #$00 ;2
3 sta x_whole.h,x ;5 = 42
3 lda y_float,x ;5
3 adc <y_float_inc,x ;4
3 sta y_float,x ;5
3 lda y_whole.l,x ;5
3 adc <y_whole_inc,x ;4
3 sta y_whole.l,x ;5
3 lda y_whole.h,x ;5
2 adc #$00 ;2
3 sta y_whole.h,x ;5 = 40
1 rts ;7
62/22
; 82+14 = 96+2=98 (102)
16.8 + 8.8 -> 16.8
These examples were trying to be in game logic context, but the prep part is actually unrealistic. I wouldn't be loading an immediate for X; it be from a object table (maybe adding 10 cycle or so more. The 68k one would be more than 10 cycles, for the same). But I did that because his (Steve Snake) fixed address for loading into A0 was a bit unrealistic as well (using LEA abs,a0 is basically a faster way to load an immediate into an address register than using move).
The 68k one is 108 cycle and the 6280 one is 98 cycles. While these aren't apples to apples straight comparison, relative to what needs to done/accomplished - I think they are directly comparable. The difference between the two are this: the 68k is using signed numbers (so you don't need to have four sets of routines) while the PCE version uses unsigned numbers and needs a jump table depending on one of the four directions the object is moving. The 68k one is using 32bit math; 16bit:16bit fixed point. So 16:16 + 16:16 -> 32bit. I consider this completely overkill. One, the whole number larger than 8bit mean you might not even see it move on screen (it could skip the screen entirely if aligned right); it's not needed. Two, 1/65535 of a pixel movement is overkill to me. Hell, even 1/256 is a little bit overkill. But... it's done out of reasons for convenience and speed.
So the 6280 one has a 16:8 (24bit) fixed point position for both X and Y. The scalar/speed is 8:8 (16bit). 24bit + 16bit -> 24bit. If I did a straight 32bit conversion of his code, then it'd be slower on the 6280. So it's adapted to what is needed, since the original is overkill. You could technically do an 8:8 fixed point position for X and Y (say for a clipped horizontal shootie or a vertical shootie) and speed it up, but I wanted a more realistic conversion of his code.
For reference, the '816 version was 80 cycles (not the SNES cpu version, since it has wait states on ram, so it's be closer to ~90 cycles) and used full 32bit variables like the 68k version (it was faster to do it that way on the '816 because of the lack of byte access opcodes).
Edit: There's nothing fancy or clever about my code, either. Sure, it uses split tables - but that's a given for any 65x array access that's larger than one byte width. No voodoo code there.