Author Topic: HuC questions.  (Read 2516 times)

touko

  • Hero Member
  • *****
  • Posts: 953
Re: HuC questions.
« Reply #30 on: May 30, 2016, 12:27:56 AM »
Do you know why huc include a dummy .dw between each datas included ??

elmer

  • Hero Member
  • *****
  • Posts: 2153
Re: HuC questions.
« Reply #31 on: May 30, 2016, 04:06:55 AM »
Do you know why huc include a dummy .dw between each datas included ??

If you're asking me, then the answer is ... no, I have no idea.

I'm concentrating on CC65, because it looks (so far) as though I can fairly-easily make a 65C02 version of CC65 that will generate code that will significantly out-perform HuC.

elmer

  • Hero Member
  • *****
  • Posts: 2153
Re: HuC questions.
« Reply #32 on: May 30, 2016, 06:48:33 AM »
OK, next question ... what do people use zero-page for, and how full does it get?

What do you think about putting C stack in zero-page (there are lots of advantages to the code)?

The "downside" (which I think is minimal on the PCE) is that you wouldn't be allowed to have large stack-based structs/arrays/etc ... they'd have to be turned into pointers (it might be possible to do that automatically).

touko

  • Hero Member
  • *****
  • Posts: 953
Re: HuC questions.
« Reply #33 on: May 30, 2016, 07:13:00 AM »
Quote
If you're asking me, then the answer is ... no, I have no idea.
Not you specificaly,but if somebody knows the answer   :D

TheOldMan

  • Hero Member
  • *****
  • Posts: 958
Re: HuC questions.
« Reply #34 on: May 30, 2016, 07:31:25 AM »
Quote
OK, next question ... what do people use zero-page for, and how full does it get?

Mostly, addresses for indirect access,  and high useage variables.
I think there are < 16 bytes free if you use the system CD stuff.

Quote
What do you think about putting C stack in zero-page (there are lots of advantages to the code)?
Why?

I think you are mis-understanding the zero-page. It's only 256 bytes.
Its the area of memory that can be reached with an address of $00xx  (well, $20xx on the pce)

elmer

  • Hero Member
  • *****
  • Posts: 2153
Re: HuC questions.
« Reply #35 on: May 30, 2016, 08:48:21 AM »
Mostly, addresses for indirect access,  and high useage variables.

Yep, that's what it is best for ... but the question is, how many are statically allocated by the game code (i.e. not the "system" variables).


Quote
I think there are < 16 bytes free if you use the system CD stuff.

Not unless we're reading different manuals.

The System Card only uses from $20DC-$20FF, and you can make that even smaller if you don't use the "graphics" functions ... $20E6-$20FF.

So we've got 214 bytes to play with.


Quote
I think you are mis-understanding the zero-page. It's only 256 bytes.
Its the area of memory that can be reached with an address of $00xx  (well, $20xx on the pce)

Yes, I know the limit, well. I think that you are mis-understanding the benefits.

It's just a design tradeoff ... you choose to limit youself to a very small C stack where you don't allocate large structs or arrays on the stack, but in return you get blindingly fast access.

The current method, where you can allocate large local variables on the stack, but all accesses use "(sp),y" ... makes expression-evaluation and parameter-passing incredibly slow, basically forcing you to make absolutely everything a "static" and avoid expressions as much as possible.

With a zero-page based stack you can take full advantage of the 65C02's instruction set and use "zp,x" (just look at the timings and the extra instructions).

When you match that with a compiler that knows when it can use byte operations instead of always using word operations, then you can make a lot of what it does pretty efficient (relatively speaking).

When you do that, there's little/no benefit to using "static" variables for most stuff.

BTW ... a lot of 6502 implementations of FORTH (such as fig-FORTH) use a zero-page data stack, and are quite happy to limit themselves to using 64-bytes, and they pass parameters and do evaluation on the stack, just like C.

Once again, I'll give these links to David Wheeler's web pages on 6502 language design and implementation choices ...

http://www.dwheeler.com/6502/
http://www.dwheeler.com/6502/a-lang.txt

<EDIT>

BTW, CC65 gives you the information that you need to access stack-based local variables in inline assembly, which together with its C macros should allow for some pretty efficient ways of optimizing some of the rough-spots in the code-generation and approach hand-assembly speed.
« Last Edit: May 30, 2016, 09:35:07 AM by elmer »

TailChao

  • Full Member
  • ***
  • Posts: 156
Re: HuC questions.
« Reply #36 on: May 30, 2016, 01:47:47 PM »
The current method, where you can allocate large local variables on the stack, but all accesses use "(sp),y" ... makes expression-evaluation and parameter-passing incredibly slow, basically forcing you to make absolutely everything a "static" and avoid expressions as much as possible.
Giving up a chunk of the ZeroPage for a (zp,X) software stack is not a huge loss. I think that's livable considering the speed improvements over (zp),Y or (zp).

But I think the real question is what people want to use C for on this platform, and how they want to write it.

Making everything static is really the only way to get good performance on the 65x family outside of the 65816, especially for your object system - statically allocated arrays of individual attributes.

Right when you bring any requirement for address + displacement into the equation, performance drops on the 6502. The problem is that many of C's great conveniences depend upon it. If you're stuck writing restricted C in order to cater to the shortcomings of the architecture then (personally) I don't see the benefit over just writing the assembly.

A compiler that knows to split a statically allocated array of structs into a struct of arrays, then further split each element larger than a byte into individual byte arrays, then access everything that way would be pretty cool (maybe something does this already?). I think this is really the biggest performance gain area - but it's also so contrary to C in general.

dshadoff

  • Full Member
  • ***
  • Posts: 175
Re: HuC questions.
« Reply #37 on: May 30, 2016, 01:49:17 PM »
Do you know why huc include a dummy .dw between each datas included ??

I'm not 100% sure whether I'm clear on what you're asking, but it could be to force 16-bit alignment on 16-bit word data.  At least, I seem to recall there was something like that.

-Dave

dshadoff

  • Full Member
  • ***
  • Posts: 175
Re: HuC questions.
« Reply #38 on: May 30, 2016, 02:12:46 PM »
Quote
OK, next question ... what do people use zero-page for, and how full does it get?

Mostly, addresses for indirect access,  and high useage variables.
I think there are < 16 bytes free if you use the system CD stuff.

Well, it's not as cramped as that, but the system card does allocate from the bottom up, and the top down.

I use it for iterators, array-base address values, 16-bit pointers, as a drop zone for passing variables to common functions (just like _bx and so on from the system card). global variables, and for very temporary storage of registers, where others might just use the stack (stack is too slow).

Quote
What do you think about putting C stack in zero-page (there are lots of advantages to the code)?

I personally wouldn't do that.
I would also try to avoid making my functions call other functions much, and avoid passing any sort of parameters in the 'C' stack, because stack-frame accounting itself wastes time and energy.  Just like local object instatiation does in C++ (instantiation often = waste).

One thing HuC does well, is to try to pass a single 8-bit or 16-bit value via registers.  Creating and dropping the stack frame is serious wasted effort on a machine of these capabilities.

One thing that a 'C' compiler - through its mere existence - does, is to lull people into a false sense that programming habits on one machine will translate well to another machine.  So, I would anticipate people passing 4 int variables in a function call.  I would anticipate 8-deep call levels.  And so I would anticipate corruption of variables due to exhausting all memory.  The target code would fail without warning (because who's going to put bounds checks in there ?), and the user would blame the compiler for his problems.

-Dave

TheOldMan

  • Hero Member
  • *****
  • Posts: 958
Re: HuC questions.
« Reply #39 on: May 30, 2016, 02:47:51 PM »
Quote
Well, it's not as cramped as that, but the system card does allocate from the bottom up, and the top down.
Okay. I just checked and there's an area between  $90 and $DC that's not being used, afaik.
So yeah, maybe not too cramped for a stack area.

Quote
So, I would anticipate people passing 4 int variables in a function call.  I would anticipate 8-deep call levels.  And so I would anticipate corruption of variables due to exhausting all memory.  The target code would fail without warning (because who's going to put bounds checks in there ?), and the user would blame the compiler for his problems
+1. 8 deep level calls is not that unusual; 4 ints as parameters isn't either (consider Rovers 'example', where there are several variables, for setting up a sprite).

Quote
Not unless we're reading different manuals.

I'm not reading a manual. I'm looking at the system card code.
Granted, you probably could use most of the zero page for a stack...but you would lose access to the cd, since a lot of cd-related variables are stored from the bottom upwards (ie, $00+)
For example, you couldn't play a cd audio track, since the TOC information is loaded down there....

Quote
With a zero-page based stack you can take full advantage of the 65C02's instruction set and use "zp,x" (just look at the timings and the extra instructions).

or, given X is an offset, you could generate labels for the entire stack area, and access values as
'lda   <stk06'. No indirection needed. Right?
Yes, I know thats not  workable in reality.

What I pesonally think would be useful is to blend the current C stack and the ZP area stack.
The ZP stack could hold the address of a parameter block. Since the parameters would be consecutive, you could place the base address (stack,x) into a temp, then use [temp],x to access them. Not as fast, but not as limiting either.
(just an idea )


I still think a good advanced (ie, not peephole) optimization program could do wonders for even the lousy code HuC generates....

TheOldMan

  • Hero Member
  • *****
  • Posts: 958
Re: HuC questions.
« Reply #40 on: May 30, 2016, 03:02:37 PM »
Quote
Quote
    Do you know why huc include a dummy .dw between each datas included ??

I'm not 100% sure whether I'm clear on what you're asking, but it could be to force 16-bit alignment on 16-bit word data.  At least, I seem to recall there was something like that.

Oddly enough, I think its actually the way Huc parses declarations and generates code.
With  a definition like " char x", it generates a 0 value.
Now change that to  "char x[10]", and it generates a 0 value. And then generates 10 more 0 values as space for the array.

Just for fun, look at the code you get for 'const char [10] = "abcdefghij"; '
IIRC, you get a 0 value, followed by 10 more 0 values, followed by the actual letters...
Though, I could be wrong. It's been a long time since I looked at that stuff.

elmer

  • Hero Member
  • *****
  • Posts: 2153
Re: HuC questions.
« Reply #41 on: May 30, 2016, 04:15:13 PM »
Well, it's not as cramped as that, but the system card does allocate from the bottom up, and the top down.
Okay. I just checked and there's an area between  $90 and $DC that's not being used, afaik.
So yeah, maybe not too cramped for a stack area.
...
I'm not reading a manual. I'm looking at the system card code.
Granted, you probably could use most of the zero page for a stack...but you would lose access to the cd, since a lot of cd-related variables are stored from the bottom upwards (ie, $00+)
For example, you couldn't play a cd audio track, since the TOC information is loaded down there....

??? OK guys, you're scaring me here ... and I missing something crucial, or are we talking about different things?  :shock:

ZP  is $2000-$20FF. The Hu7 CD manual clearly documents that $2000-$20DB are User Area (i.e. free for use).

RAM is $2200-$3FFF. The Hu7 CD manual clearly documents that $2680-$3FFF are User Area (i.e. free for use).

AFAIK, any other usage of ZP that you're currently seeing is something to do with HuC, and not the CD System Card.

Am I missing something?  :pray:


I would also try to avoid making my functions call other functions much, and avoid passing any sort of parameters in the 'C' stack, because stack-frame accounting itself wastes time and energy.  Just like local object instatiation does in C++ (instantiation often = waste).

One thing HuC does well, is to try to pass a single 8-bit or 16-bit value via registers.  Creating and dropping the stack frame is serious wasted effort on a machine of these capabilities.

Yep, CC65 also passes the last parameter in registers rather than on the stack.

And "yes", any stack handling is slower than none at all ... but when the stack is in ZP, then accessing it is just as fast as the fastest static variable, and stack handling becomes just "dex" ... which is about as fast as you can get.

It's like having the compiler automatically create ZP static variables for you without you having to think about it.


One thing that a 'C' compiler - through its mere existence - does, is to lull people into a false sense that programming habits on one machine will translate well to another machine.  So, I would anticipate people passing 4 int variables in a function call.  I would anticipate 8-deep call levels.  And so I would anticipate corruption of variables due to exhausting all memory.  The target code would fail without warning (because who's going to put bounds checks in there ?), and the user would blame the compiler for his problems.

Well, 8 levels deep with 4 ints per level is 64 bytes. Well within a 128 byte stack.

There's no reason that there would be no warning. Stack checking on a PCE, if enabled, could be as simple as a "dex; bmi overflow".

It would also be trivial to have an emulator, like Mednafen, specifically watch for a stack overflow (even without the overhead of an embedded "bmi overflow"), and break the program.

That's just a debugging improvement. Like adding symbol support to Mednafen, and even code profiling. None of those are particularly difficult.

And speaking of stack overflows without warning ... does HuC support stack checking? I can't see an option for it on the HuC command line.  :-k


+1. 8 deep level calls is not that unusual; 4 ints as parameters isn't either (consider Rovers 'example', where there are several variables, for setting up a sprite).

If you're doing that in HuC, then you're generating some pretty slow and ugly code ... unless everything is already declared as a static.


or, given X is an offset, you could generate labels for the entire stack area, and access values as
'lda   <stk06'. No indirection needed. Right?
Yes, I know thats not  workable in reality.

If you're talking about accessing local variables without indirection, then "yes", that's what I'm already implementing in CC65, and it's easy because the compiler already knows the offset of any local variable relative to the current stack pointer.

So every local variable in a function is just "stack+offset,x" ... fast.

If you're talking about the step beyond that, where the compiler/linker actually analyzes the code at link time and gives every single parameter and local variable a static location in memory ... then that's also workable. SDCC implements that strategy.

Unfortunately, I don't think that I'm ready to add complete 65C02 processor support to SDCC, its assembler and its linker.


Quote
What I pesonally think would be useful is to blend the current C stack and the ZP area stack.
The ZP stack could hold the address of a parameter block. Since the parameters would be consecutive, you could place the base address (stack,x) into a temp, then use [temp],x to access them. Not as fast, but not as limiting either.
(just an idea )

I take it that you really mean "[temp],y" to access them. I guess that I'm missing something again. How is that an improvment over the current HuC "[stack],y"?


Quote
I still think a good advanced (ie, not peephole) optimization program could do wonders for even the lousy code HuC generates....

Improving either HuC or CC65's actual internal optimization would be great, but is beyond my interest level.

If someone really wants an optimized C on the 65C02, then I suggest that they look at getting SDCC processor support implemented ... it's not supposed to be totally horrible to do. That way you'd get a real modern C compiler with all the expected optimizations (like constant propogation, loop invariables, dead-code elimination, etc, etc).

Perhaps that could be Bonknuts' Degree/Masters project!  :wink:

dshadoff

  • Full Member
  • ***
  • Posts: 175
Re: HuC questions.
« Reply #42 on: May 30, 2016, 04:56:38 PM »
Well, it's not as cramped as that, but the system card does allocate from the bottom up, and the top down.
Okay. I just checked and there's an area between  $90 and $DC that's not being used, afaik.
So yeah, maybe not too cramped for a stack area.
...
I'm not reading a manual. I'm looking at the system card code.
Granted, you probably could use most of the zero page for a stack...but you would lose access to the cd, since a lot of cd-related variables are stored from the bottom upwards (ie, $00+)
For example, you couldn't play a cd audio track, since the TOC information is loaded down there....

??? OK guys, you're scaring me here ... and I missing something crucial, or are we talking about different things?  :shock:

ZP  is $2000-$20FF. The Hu7 CD manual clearly documents that $2000-$20DB are User Area (i.e. free for use).

Hmm... you may be right on this (after I checked a couple of pieces of actual code).

One thing that a 'C' compiler - through its mere existence - does, is to lull people into a false sense that programming habits on one machine will translate well to another machine.  So, I would anticipate people passing 4 int variables in a function call.  I would anticipate 8-deep call levels.  And so I would anticipate corruption of variables due to exhausting all memory.  The target code would fail without warning (because who's going to put bounds checks in there ?), and the user would blame the compiler for his problems.

Well, 8 levels deep with 4 ints per level is 64 bytes. Well within a 128 byte stack.

There's no reason that there would be no warning. Stack checking on a PCE, if enabled, could be as simple as a "dex; bmi overflow".

Wait a second.

First, I wouldn't want a compiler to tell me that I can no longer write hand-coded assembly which accesses zero page.

Second, don't forget that the stack frame is not used only for parameter passing; it's also used for local variables in a standard C compiler.  So, if somebody decides to have 15 local int variables (not unlikely), that's 30 of your 200 bytes in just one call level.  If somebody wants to allocate a local array or struct, it could be completely gone.

By the way, this is why I have said repeatedly in the past that globals are the way to go for variables in HuC, as they are given a specific address and are accessed with absolute addressing mode (many times faster than stack).  In fact, I would even like the opportunity to selectively promote some of these globals to ZP for faster direct access.

Dave

TheOldMan

  • Hero Member
  • *****
  • Posts: 958
Re: HuC questions.
« Reply #43 on: May 30, 2016, 04:59:05 PM »
Quote
OK guys, you're scaring me here ... and I missing something crucial, or are we talking about different things? 

I don't think we're talking different things.....
The CD system BIOS uses the zp area, which is supposdly unused, to store parameters about the current CD. Like the TOC information. Current audio playing position. Etc.
It is interesting to note that $3a in the zp area is used in the standard timer irq routine, as an "I'm already handling this..." flag. (At least, thats what I think it is.)

The variables you are looking at in the zp are the ones common between the cd system and the stock routines for cards (I think) I believe a lot of the cd bios routines were also available as either source code, or a standard library for making cards.

Quote
Well, 8 levels deep with 4 ints per level is 64 bytes. Well within a 128 byte stack.

I guess I just don't get the point of using a zp stack area.
If it's going to be limited to 128 bytes, can't you do that on the system stack?
Why do it using semi-valuable zp space, which can be used for pointers, high-speed counters, general registers, etc?

Quote
If you're doing that in HuC, then you're generating some pretty slow and ugly code ... unless everything is already declared as a static.
Slow ugly code wasn't a problem while we were developing it, but yes, things got moved to static (ie ram) variables as part of the optimization process :)

Quote
I take it that you really mean "[temp],y" to access them. I guess that I'm missing something again. How is that an improvement over the current HuC "[stack],y"?

no, i really meant temp,x....but maybe I didn't exactly explain it clearly. The thought was to do it the same way most tia/tai/tii etc instructions are done; set up a small routine, with the address as a variable. Then you could call the routine to get the value. I realize its probably not faster, but it's doable. Hey, not all my ideas are -good- ones :)

Quote
Improving either HuC or CC65's actual internal optimization would be great, but is beyond my interest level.
No, not the internal optimization. A seperate optimizer program. That goes through the code (from either HuC or CC65 and rearranges/rewrites it in a more optimized form. Which you then assemble and/or link.
I still think it would be easier to do, and give you better optimized code.

TheOldMan

  • Hero Member
  • *****
  • Posts: 958
Re: HuC questions.
« Reply #44 on: May 30, 2016, 05:22:03 PM »
<edit>
Okay, you may be right.

I guess my disassembler has a problem, as it's thinking $2200 is the zp area.
that is,  I get lda    <cdTocBuf+1 .... but the opcode shows  a9 22......<sigh>
That's gonna set me back a bit....

And that timer value is  shown as inc  <$36.... with an data value of $e6.
Carry. on.