That ANSI definition has annoyed me for decades, so would anyone object if I just have the functions return a pointer to the end of the string/memory, which is something that is actually useful information?
I've never used/assigned the value returned from those functions, so it doesn't matter to me personally either way.
I agree, I've never found the "standard" return values to be useful.
But I can't count the number of times that I've had to do ...
strcpy(ptr, string);
ptr += strlen(ptr);It would be much nicer (and faster) to say ...
ptr = strcpy(ptr, string);I think that I'll take advantage of the fact that the "classic" HuC didn't set the return values at all in order to make the change.
I've actually done that already, and checked-in the new str/mem functions into github.
The new functions are approx 60% of the size of the old functions, but run 2 or 3 times faster (depending upon which function).
That's 230 bytes for the package vs 398 bytes in the old HuC.
It leads me into a bit of a "rant" about the dangers of using macros in assembly language.
**************************************
Macros are great ... they're useful for inlcuding common little sequences of code in a single instruction that can make code easier to write, and easier to read.
But it's easy to get lazy and not really think about what is going on inside them, and end up writing sloppy code if you're not careful.
This isn't so bad in a function that gets called once in a game ... but it's not good practice in library functions that are supposed to be small and fast, especially if you're thinking that new programmers might look at them as examples of how-to-program.
For instance, here the old HuC/MagicKit library function for memcpy() ...
_memcpy.3: __stw <_ax
.cpylp: lda [_si]
sta [_di]
incw <_si
incw <_di
decw <_ax
tstw <_ax
bne .cpylp
rtsIt looks nice-and-simple, and it's easy to read, and it's so short that it must be fast, right?
Well ... no!
There are a whole bunch of macros in there, which expand the code out into ...
_memcpy.3: stx <__ax
sta <__ax+1
.cpylp: lda [__si]
sta [__di]
inc <__si
bne .l1
inc <__si+1
.l1: inc <__di
bne .l2
inc <__di+1
.l2: sec
lda <__ax
sbc #1
sta <__ax
lda <__ax+1
sbc #0
sta <__ax+1
lda <__ax
ora <__ax+1
bne .cpylp
.done: rtsThat's a *huge* and *slow* inner-loop, taking 68 cycles per byte that's copied.
If you get rid of all of those macros and just write it carefully in optimized assembly language, you get ...
_memcpy.3: stx <__temp
tax
beq .done_pages
cly
.copy_page: lda [__si],y
sta [__di],y
iny
bne .copy_page
inc <__si+1
inc <__di+1
dex
bne .copy_page
.done_pages: ldx <__temp
beq .done_bytes
.copy_byte: lda [__si],y
sta [__di],y
iny
dex
bne .copy_byte
.done_bytes: rtsThe function is both smaller, and a lot faster, taking 22 cycles per byte that's copied.
That's a 3x improvement in speed, and just about as good as you can get on the classic 6502 architecture.
You can do a little loop-unrolling to make it a tiny bit faster ... but it's not a huge improvement.
This version trades that little bit of speed in favor of staying smaller since it's a rarely-used function in a PCE game.
As bonknuts and touko will point out, the way to do it more efficiently on the PCE is to use a TII instruction, which runs at 6 cycles per byte.
I'm just not convinced (yet) that these functions are used often-enough that it's worth the increase in code-size for making a general-purpose TII version of the routine.