Author Topic: The new fork of HuC (Read 17135 times)

Bonknuts · « **Reply #225 on:** December 05, 2016, 02:30:12 PM »

Quote from: elmer on December 05, 2016, 06:42:27 AM

I have another question ...

Does anyone know why there's special-handling for bank $FE in this HuC code?

What is in bank $FE? I've never heard of anything being in that bank before.

Code: [Select]
; ---- ; map_data ; ---- ; map data in page 3-4 ($6000-$9FFF) ; ---- ; IN : _BL = data bank ; _SI = data address ; ---- ; OUT: _BX = old banks ; _SI = remapped data address ; ---- map_data: ldx <__bl ; ---- ; save current bank mapping ; tma #3 sta <__bl tma #4 sta <__bh ; -- cpx #$FE bne .l1 ; -- stx <__bp rts ; ---- ; map new bank ; .l1: stz <__bp ; -- txa tam #3 inc A tam #4 ; ---- ; remap data address to page 3 ; lda <__si+1 and #$1F ora #$60 sta <__si+1 rts

Looks like a runtime check to me. Which shouldn't be there, unless you believe you can physically damage the PCE by writing to some unknown or known bits of the VDC regs. It's $fe because the routine is mapping a bank in as 16k code; so $fe/$ff.

dshadoff · « **Reply #226 on:** December 05, 2016, 04:37:21 PM »

Quote from: elmer on December 05, 2016, 02:26:34 PM

Now Uli had actually changed the functions to return the ANSI-standard values, which is almost-always just a copy of the original pointer that's passed into the function ... which is great from the POV of standards-compliance, but is absolutely useless in practice (in my experience).

That ANSI definition has annoyed me for decades, so would anyone object if I just have the functions return a pointer to the end of the string/memory, which is something that is actually useful information?

I've never used/assigned the value returned from those functions, so it doesn't matter to me personally either way.

-Dave

elmer · « **Reply #227 on:** December 06, 2016, 11:31:14 AM »

Quote from: dshadoff on December 05, 2016, 04:37:21 PM

Quote from: elmer on December 05, 2016, 02:26:34 PM
That ANSI definition has annoyed me for decades, so would anyone object if I just have the functions return a pointer to the end of the string/memory, which is something that is actually useful information?

I've never used/assigned the value returned from those functions, so it doesn't matter to me personally either way.

I agree, I've never found the "standard" return values to be useful.

But I can't count the number of times that I've had to do ...

strcpy(ptr, string); ptr += strlen(ptr);

It would be much nicer (and faster) to say ...

ptr = strcpy(ptr, string);

I think that I'll take advantage of the fact that the "classic" HuC didn't set the return values at all in order to make the change.

I've actually done that already, and checked-in the new str/mem functions into github.

The new functions are approx 60% of the size of the old functions, but run 2 or 3 times faster (depending upon which function).

That's 230 bytes for the package vs 398 bytes in the old HuC.

It leads me into a bit of a "rant" about the dangers of using macros in assembly language.

**************************************

Macros are great ... they're useful for inlcuding common little sequences of code in a single instruction that can make code easier to write, and easier to read.

But it's easy to get lazy and not really think about what is going on inside them, and end up writing sloppy code if you're not careful.

This isn't so bad in a function that gets called once in a game ... but it's not good practice in library functions that are supposed to be small and fast, especially if you're thinking that new programmers might look at them as examples of how-to-program.

For instance, here the old HuC/MagicKit library function for memcpy() ...
_memcpy.3: __stw <_ax .cpylp: lda [_si] sta [_di] incw <_si incw <_di decw <_ax tstw <_ax bne .cpylp rts

It looks nice-and-simple, and it's easy to read, and it's so short that it must be fast, right?

Well ... no!

There are a whole bunch of macros in there, which expand the code out into ...
_memcpy.3: stx <__ax sta <__ax+1 .cpylp: lda [__si] sta [__di] inc <__si bne .l1 inc <__si+1 .l1: inc <__di bne .l2 inc <__di+1 .l2: sec lda <__ax sbc #1 sta <__ax lda <__ax+1 sbc #0 sta <__ax+1 lda <__ax ora <__ax+1 bne .cpylp .done: rts

That's a *huge* and *slow* inner-loop, taking 68 cycles per byte that's copied.

If you get rid of all of those macros and just write it carefully in optimized assembly language, you get ...
_memcpy.3: stx <__temp tax beq .done_pages cly .copy_page: lda [__si],y sta [__di],y iny bne .copy_page inc <__si+1 inc <__di+1 dex bne .copy_page .done_pages: ldx <__temp beq .done_bytes .copy_byte: lda [__si],y sta [__di],y iny dex bne .copy_byte .done_bytes: rts

The function is both smaller, and a lot faster, taking 22 cycles per byte that's copied.

That's a 3x improvement in speed, and just about as good as you can get on the classic 6502 architecture.

You can do a little loop-unrolling to make it a tiny bit faster ... but it's not a huge improvement.

This version trades that little bit of speed in favor of staying smaller since it's a rarely-used function in a PCE game.

As bonknuts and touko will point out, the way to do it more efficiently on the PCE is to use a TII instruction, which runs at 6 cycles per byte.

I'm just not convinced (yet) that these functions are used often-enough that it's worth the increase in code-size for making a general-purpose TII version of the routine.

Bonknuts · « **Reply #228 on:** December 06, 2016, 12:15:27 PM »

Minus the overhead, isn't that 20 cycles a byte? Just 6 more bytes to unroll and drop it down to 17 cycles a byte (minus overhead).

Maybe it's not used because it was so slow? Get it down to 9cycles a byte with self-modifying Txx (16 bytes) code, and maybe it'll be more valuable.

So.. what is the function anyway.. memcpy()? Is there a fmemcpy()?

elmer · « **Reply #229 on:** December 06, 2016, 01:40:41 PM »

Quote from: Bonknuts on December 06, 2016, 12:15:27 PM

Minus the overhead, isn't that 20 cycles a byte? Just 6 more bytes to unroll and drop it down to 17 cycles a byte (minus overhead).

Maybe it's not used because it was so slow? Get it down to 9cycles a byte with self-modifying Txx (16 bytes) code, and maybe it'll be more valuable.

Yep, 20 cycles-per-byte for the upper loop, but 22 cycles-per-byte for the lower loop.

6-bytes more for 17-cycles-per-byte? I'd be interested in seeing that!

The best (simple change) that I can do is this ...

_mempcpy.3: _memcpy.3: stx <__temp cly tax beq .done_pages .copy_page: lda [__si],y sta [__di],y iny lda [__si],y sta [__di],y iny bne .copy_page inc <__si+1 inc <__di+1 dex bne .copy_page .done_pages: lsr <__temp ldx <__temp beq memstr_finish bcs .copy_1byte dex .copy_2bytes: lda [__si],y sta [__di],y iny .copy_1byte: lda [__si],y sta [__di],y iny dex bpl .copy_2bytes .done_bytes: rts

That costs 15 bytes ... and it only gets me down to 18-cycles-per-byte on the upper loop, and 19 cycles-per-byte on the lower loop,

These strxxx/memxxx routines are located in the permanent LIB1 bank, and I'm trying to free up space in there.

At this point they're 2..3 times faster than before, and so small that (IMHO) they're just not good candidates for moving into the LIB2 bank.

Quote

Maybe it's not used because it was so slow?

I just don't see memcpy() as being one of those functions that gets called a lot during each cycle of a game's main loop, and so I don't think that it's something that would benefit from being much faster.

If someone deserately needs a *fast* memcpy(), then they're better-off with an inline TII instruction.

It's a cost-vs-benefit tradeoff for the most-likely usage of the functions.

"Yes" ... it can be made faster. But would anyone care?

Quote from: Bonknuts on December 06, 2016, 12:15:27 PM

So.. what is the function anyway.. memcpy()? Is there a fmemcpy()?

Plain-old memcpy(). It's at the bottom of the include/pce/library.asm file.

nodtveidt · « **Reply #230 on:** December 06, 2016, 01:49:42 PM »

Quote from: elmer on December 06, 2016, 01:40:41 PM

"Yes" ... it can be made faster. But would anyone care?

Probably not me, haha

I have used memcpy() a grand total of once in all my years of coding in HuC... it's used in Mysterious Song, in the battle program, once.

Bonknuts · « **Reply #231 on:** December 06, 2016, 05:15:05 PM »

Quote from: elmer on December 06, 2016, 01:40:41 PM

6-bytes more for 17-cycles-per-byte? I'd be interested in seeing that!

Doh! I meant 6 more load/stores (unrolled). Haha, yeah not bytes.

Code: [Select]

_memcpy.3:
        stx <__temp
        tax
      beq .done_pages       
        cly
.upper_loop
        lda [__si],y
        sta [__di],y
        iny
        lda [__si],y
        sta [__di],y
        iny
        lda [__si],y
        sta [__di],y
        iny
        lda [__si],y
        sta [__di],y
        iny
      bne .upper_loop
        inc <__si+1
        inc <__di+1
        dex
      bne .upper_loop               
.done_pages
        lda <_temp
      beq .out
        and #$fc
      beq .left_overs
        tax
.lower_loop
        lda [__si],y
        sta [__di],y
        iny
        lda [__si],y
        sta [__di],y
        iny
        lda [__si],y
        sta [__di],y
        iny
        lda [__si],y
        sta [__di],y
        iny
        dex
      bne .lower_loop
.left_overs
        lda <_temp
        and #$03
      beq .out
        tax       
.loop_lastbytes
        lda [__si],y
        sta [__di],y
        iny
        dex
      bne .loop_lastbytes
.out
  rts

(I remember doing something similar in x86 asm, where bulk was done as 32bit copies, and the remaining bytes were byte copies)

Quote

"Yes" ... it can be made faster. But would anyone care?

I try to think outside of my own perspective. Of course the solution would be simply with a little bit of ASM, but not every one whats to learn or use ASM. Plus, I dunno - I have no idea what some higher level programmers have in mind when they design stuffs - haha. The only think I can think of is copying far data to near memory, in an HuC scenario (since no direct bank control).

Quote

I just don't see memcpy() as being one of those functions that gets called a lot during each cycle of a game's main loop, and so I don't think that it's something that would benefit from being much faster.

Maybe. But I'm thinking worse case scenario; where it might get called every so many frames. Then it becomes part of max resource profile. But you've already improved performance by a lot, so I guess a couple of cycles per byte savings isn't something to sweat.

Bonknuts · « **Reply #232 on:** December 07, 2016, 08:19:31 AM »

So (to lazy as the moment to look at memcpy() arguments. The ".3" tells me there's argument overloading), but how fast is peek()? In other words, if you want to only copy a handful of bytes (say you have a large area for a stage, but you want to move "objects" in and out of active window area).

elmer · « **Reply #233 on:** December 07, 2016, 11:28:44 AM »

Quote from: Bonknuts on December 07, 2016, 08:19:31 AM

So (to lazy as the moment to look at memcpy() arguments.

"fastcall memcpy(word di, word si, word acc)",

Quote

The ".3" tells me there's argument overloading), but how fast is peek()?

Do you mean this? It's the version with the new register layout ...

_peekw: sta <__ptr+0 sty <__ptr+1 ldy #1 lda [__ptr],y tay lda [__ptr] rts

Quote

In other words, if you want to only copy a handful of bytes (say you have a large area for a stage, but you want to move "objects" in and out of active window area).

IMHO, HuC's large overhead in doing anything is going to dwarf any small 1-or-2-cycle-per-byte inefficiency in memcpy().

As The Old Rover pointed out ... he's not using memcpy() in Henshin Engine or in Lucretia, and DK isn't using it anywhere in Catastrophy.

It's a case of the classic optimization "truth" in programming ... 90% of the CPU time is spent executing 10% of the code.

There's little point in making stuff-that-isn't-used bigger in order to make it faster, when that memory could be better-used by optimizing things that do get used all the time ... like the horrible load_vram() function that desperately needs to be rewritten to use TIA instructions.

DarkKobold · « **Reply #234 on:** December 07, 2016, 11:55:12 AM »

Quote from: elmer on December 07, 2016, 11:28:44 AM

like the horrible load_vram() function that desperately needs to be rewritten to use TIA instructions.

This, absolutely this. Our final level currently has a hardware only (not in emulator) glitch when doing a relatively small load_vram() set. Yeah, I could divide it into different frames, but that is extra code.

Also, I'm confused on #includes.

Lets say we load a 32x64 sprite:

#incspr(Plug, "spr/example.pcx", 0,0,2,4);
load_vram(0x4000, Plug, 0x200);

This works fine.

#incspr(Plug1, "spr/example.pcx", 0,0,2,2);
#incspr(Plug2, "spr/example.pcx", 0,32,2,2);
load_vram(0x4000, Plug1, 0x200);

This makes a garbage sprite. What is the extra stuff between sprites in ROM?

nodtveidt · « **Reply #235 on:** December 07, 2016, 12:02:28 PM »

Code: [Select]

	.code
	.data
	.dw $0
_plats:
	.incspr "sprites/plats2.pcx",0,0,2,1

Just an example from the current sourcecode I'm working with. I'm guessing that this .dw $0 is messing ya up. I guess it puts a 0 between each sprite block?

elmer · « **Reply #236 on:** December 07, 2016, 12:20:51 PM »

Quote from: The Old Rover on December 07, 2016, 12:02:28 PM

Just an example from the current sourcecode I'm working with. I'm guessing that this .dw $0 is messing ya up. I guess it puts a 0 between each sprite block?

I've looked at the Catastrophy source, and I don't think that it's always a "0" that's actually assembled in there by PCEAS.

I'd love to know *why* HuC/PCEAS is putting *anything* in there???

At some point I'll probably find the time to track it down, but I'd sure love someone to save me that time and just tell me is going on!

nodtveidt · « **Reply #237 on:** December 07, 2016, 12:25:07 PM »

I see this too, same source code:

Code: [Select]

	.code
	.data
	.dw $0800
_font:
	.incchr "tiles/gamefont.pcx",0,0,32,3

Bonknuts · « **Reply #238 on:** December 07, 2016, 01:11:37 PM »

Quote from: The Old Rover on December 07, 2016, 12:25:07 PM

I see this too, same source code:
Code: [Select]
.code .data .dw $0800 _font: .incchr "tiles/gamefont.pcx",0,0,32,3

That looks like the size (in words) of the graphic data.

nodtveidt · « **Reply #239 on:** December 07, 2016, 02:18:58 PM »

Hrm... I am not sure, as 32x3 8x8 tiles comes out to 0x600 words, not 0x800 words... unless the compiler is assuming 32x4 for some odd reason?

Author Topic: The new fork of HuC (Read 17135 times)

Bonknuts

Re: The new fork of HuC

dshadoff

Re: The new fork of HuC

elmer

Re: The new fork of HuC

Bonknuts

Re: The new fork of HuC

elmer

Re: The new fork of HuC

nodtveidt

Re: The new fork of HuC

Bonknuts

Re: The new fork of HuC

Bonknuts

Re: The new fork of HuC

elmer

Re: The new fork of HuC

DarkKobold

Re: The new fork of HuC

nodtveidt

Re: The new fork of HuC

elmer

Re: The new fork of HuC

nodtveidt

Re: The new fork of HuC

Bonknuts

Re: The new fork of HuC

nodtveidt

Re: The new fork of HuC