Author Topic: HuC questions. (Read 2506 times)

elmer · « **Reply #45 on:** May 30, 2016, 05:31:58 PM »

Dave, I don't see that we're actually arguing from a hugely different viewpoint here.

Perhaps I'm willing to consider tailoring my C code to the platform a little more than you are.

But ... if I'm even going to consider using C at all, then I'm unlikely to follow Arkhan's example of hand-editing the compiler's output to make it suck less.

At the moment, I'm just editing CC65 because it's an easy target to improve.

Quote from: dshadoff on May 30, 2016, 04:56:38 PM

Second, don't forget that the stack frame is not used only for parameter passing; it's also used for local variables in a standard C compiler. So, if somebody decides to have 15 local int variables (not unlikely), that's 30 of your 200 bytes in just one call level. If somebody wants to allocate a local array or struct, it could be completely gone.

For a start ... I would disallow any local arrays or structs on the stack.

That's a nasty 1st-pass solution ... the 2nd pass "fix" would be to allocate them dynamically in memory ... on a stack.

Then they'd be accessed just as slowly as they currently are in HuC!

Bad code in gives bad code out. I see no practical difference in the methods.

The idea is to optimize what can be sensibly optimized, and then to try not to break too much else.

Quote from: dshadoff on May 30, 2016, 04:56:38 PM

Wait a second.

First, I wouldn't want a compiler to tell me that I can no longer write hand-coded assembly which accesses zero page.
...
By the way, this is why I have said repeatedly in the past that globals are the way to go for variables in HuC, as they are given a specific address and are accessed with absolute addressing mode (many times faster than stack). In fact, I would even like the opportunity to selectively promote some of these globals to ZP for faster direct access.

In the scheme that I'm proposing, you're still left with 48+ bytes of space to use however you wish.

If you're willing to juggle the overlapping usage of more than a few dozen static ZP variables in your head, and you're going to use globals for speed instead of using the stack, then you can just reduce the size of the data stack, and get yourself more free space for your static variables.

Just remember ... the cost/benefit performance difference for some of your "global" optimizations would be radically different with this ZP-stack.

BTW ... you may not realize this, but CC65 only allocates 1 byte of stack space for "char" variables ... so if you're using them extensively for speed (as you should be), then 128 bytes can give you a significant amount of variables.

elmer · « **Reply #46 on:** May 30, 2016, 05:51:13 PM »

Quote from: TheOldMan on May 30, 2016, 04:59:05 PM

I guess I just don't get the point of using a zp stack area.
If it's going to be limited to 128 bytes, can't you do that on the system stack?
Why do it using semi-valuable zp space, which can be used for pointers, high-speed counters, general registers, etc?

You can't use the hardware stack because the 6502 series didn't get stack-relative addressing until the WDC65816.

Anyway ... it's actually useful (in practice) to have the hardware stack available for temporary storage (a push and a pull are 1 cycle faster than a ZP save/load).

CC65's "register" variables will be pushed onto the hardware stack so that you've got fixed ZP locations for pointers. Slower than using a static variable (which you can still choose to do), but faster than "dynamic" pointers (in either CC65 or HuC).

One of the interesting things about putting locals on a ZP stack is that you can do a no-cost local-variable pointer access with "lda (stack+offset,x)".

Sometimes (but only sometimes), that would be just as useful as having the pointer in a static variable.

Quote

Slow ugly code wasn't a problem while we were developing it, but yes, things got moved to static (ie ram) variables as part of the optimization process

Part of the idea is to make the generated code suck less so that less "optimization" time is required.

Quote

No, not the internal optimization. A seperate optimizer program. That goes through the code (from either HuC or CC65 and rearranges/rewrites it in a more optimized form. Which you then assemble and/or link.
I still think it would be easier to do, and give you better optimized code.

Perhaps that would work ... but by that stage you've thrown away so much information about the intent of the code that I'd be surprised if the analysis that you'd have to do would be any easier than just doing more optimization inside the compiler itself.

aurbina · « **Reply #47 on:** May 30, 2016, 06:27:36 PM »

Quote from: elmer on May 26, 2016, 06:42:28 AM

Quote from: aurbina on May 26, 2016, 05:24:33 AM
Well, I use the toolchain under windows, and I believe Ulrich did as well. Using MinGW and MSYS http://www.mingw.org/wiki/msys

Hmmm ... that's weird!

I abandoned the original mingw/msys project a few years ago because it was getting so old and out-of-date.

I'm using the mingw-w64/msys2 combination instead which has been an absolute pleasure to work with after my experiences with mingw/msys.

https://sourceforge.net/projects/msys2/

This is the first time that I've heard of the old mingw having a feature that the new mingw-w64 is missing.

In this case, I can't compile Ulrich's HuC source because he's using "fmemopen", which the original HuC project didn't use.

It wouldn't be hard to rewrite the output code to use a different method instead, but I'm not at the point of wanting to do so, yet.

Quote from: elmer on May 26, 2016, 06:42:28 AM

Quote from: aurbina on May 26, 2016, 05:24:33 AM
Well, I use the toolchain under windows, and I believe Ulrich did as well. Using MinGW and MSYS http://www.mingw.org/wiki/msys

Hmmm ... that's weird!

I abandoned the original mingw/msys project a few years ago because it was getting so old and out-of-date.

I'm using the mingw-w64/msys2 combination instead which has been an absolute pleasure to work with after my experiences with mingw/msys.

https://sourceforge.net/projects/msys2/

This is the first time that I've heard of the old mingw having a feature that the new mingw-w64 is missing.

In this case, I can't compile Ulrich's HuC source because he's using "fmemopen", which the original HuC project didn't use.

It wouldn't be hard to rewrite the output code to use a different method instead, but I'm not at the point of wanting to do so, yet.

I was completely wrong, it doesn't use MinGW.

Since I had all automated with scripts. Right now I checked and I compile using Cygwin. Just wanted to make that correction, sorry.

Arkhan · « **Reply #48 on:** May 30, 2016, 06:36:50 PM »

I don't hand alter the compiler output, lol. fuuuuuuuuck that. the generated output is all macro-oni and cheese looking.

After the game (Atlantean) is functional and the AI is how I want, I just convert the C code to asm by hand and leave it in #asm blocks inside of C function calls.

The majority of functions take in no arguments. all the variables are global.

So, I'm essentially just using C because it's much simpler to try and experiment with AI/gameplay mechanics in C.

as it turns out, most games are not that complicated. Converting whatever you've done into assembly after you're sure the C stuff is working right is not that complicated.

The perk to having it in C first is, now I have the code for if I want to go plop the bastard on a different platform. 6502 is braindamaged. Converting 6502 to z80 would make me want to shoot myself. Rebuilding C to z80 and re-writing where needed would be much less moronic.

A game like Atlantean suffers minimal slowdown in these instances. If it didn't scroll two ways, and didn't have to constantly track EVERY enemy (even off screen ones), it would *fly*.

AKA: I could turn the game into a competent horizontal shooter, simply.

elmer · « **Reply #49 on:** May 30, 2016, 07:06:10 PM »

Quote from: Arkhan on May 30, 2016, 06:36:50 PM

I don't hand alter the compiler output, lol. fuuuuuuuuck that. the generated output is all macro-oni and cheese looking.

After the game (Atlantean) is functional and the AI is how I want, I just convert the C code to asm by hand and leave it in #asm blocks inside of C function calls.

Ah, sorry, I thought that you'd got a macro-expanded version of the source and then fixed up the compiler-idiocies.

Yes, recoding is from C into ASM makes a lot of sense.

I'm just trying to come up with a halfway-house solution where I could potentially do some coding in C for speed-of-development, and then not need to rewrite so much of it in ASM.

Quote

The majority of functions take in no arguments. all the variables are global.

So, I'm essentially just using C because it's much simpler to try and experiment with AI/gameplay mechanics in C.

Yes, so you're already tailoring your code so that it matches the architecture capabilities of 8-bit CPUs, and you're using C for its speed of prototyping and its ability to simplify some of the tiresome "grunt-work" while you're still putting the game together.

That sounds like exactly the position that I'm trying to see if I can get to (in a reasonable amount of time).

We just seem to have different expectations of what the "minimal" level of the compiler-generated runtime performance is.

Quote

Rebuilding C to z80 and re-writing where needed would be much less moronic.

Which is another reason to have your input code look as much like standard ANSI C as possible.

touko · « **Reply #50 on:** May 30, 2016, 08:28:04 PM »

Quote from: dshadoff on May 30, 2016, 01:49:17 PM

Quote from: touko on May 30, 2016, 12:27:56 AM
Do you know why huc include a dummy .dw between each datas included ??

I'm not 100% sure whether I'm clear on what you're asking, but it could be to force 16-bit alignment on 16-bit word data. At least, I seem to recall there was something like that.

-Dave

Ok, thanks dave, it's a little bit annoying when you can transfert multiple datas at once(in ASM), and you cannot because of that .

elmer · « **Reply #51 on:** June 03, 2016, 05:14:09 AM »

Quote from: TailChao on May 30, 2016, 01:47:47 PM

Giving up a chunk of the ZeroPage for a (zp,X) software stack is not a huge loss. I think that's livable considering the speed improvements over (zp),Y or (zp).

But I think the real question is what people want to use C for on this platform, and how they want to write it.

Making everything static is really the only way to get good performance on the 65x family outside of the 65816, especially for your object system - statically allocated arrays of individual attributes.

Right when you bring any requirement for address + displacement into the equation, performance drops on the 6502. The problem is that many of C's great conveniences depend upon it. If you're stuck writing restricted C in order to cater to the shortcomings of the architecture then (personally) I don't see the benefit over just writing the assembly.

Quote from: dshadoff on May 30, 2016, 04:56:38 PM

First, I wouldn't want a compiler to tell me that I can no longer write hand-coded assembly which accesses zero page.

Second, don't forget that the stack frame is not used only for parameter passing; it's also used for local variables in a standard C compiler. So, if somebody decides to have 15 local int variables (not unlikely), that's 30 of your 200 bytes in just one call level. If somebody wants to allocate a local array or struct, it could be completely gone.

By the way, this is why I have said repeatedly in the past that globals are the way to go for variables in HuC, as they are given a specific address and are accessed with absolute addressing mode (many times faster than stack). In fact, I would even like the opportunity to selectively promote some of these globals to ZP for faster direct access.

Hmmmm ... the more that I think about this and actually mangle CC65's source code, the more that I'm coming to the conclusion that I need to step back for a while and rethink this.

As I look at the code, and get passed the idea of how much faster that one addressing mode "zp,x" is than "(zp),y" ... I'm thinking more about the actual usage of the stack, and I can see that you're both looking at things from a more experienced and superior perspective.

There's absolutely no way that I'm going to make stack-based access a sensible alternative to static and global variables, and that the limitations that I'm imposing with a permanent stack pointer in the X register, and requiring the use of the so much zero-page memory, and both too much of a cost for the benefits that they might provide.

Quote from: TailChao on May 30, 2016, 01:47:47 PM

A compiler that knows to split a statically allocated array of structs into a struct of arrays, then further split each element larger than a byte into individual byte arrays, then access everything that way would be pretty cool (maybe something does this already?). I think this is really the biggest performance gain area - but it's also so contrary to C in general.

Yes, that would be lovely ... but, as you say, it's not really C anymore if the compiler is going to do that.

I think that with the limits of the 65xx, we're really looking at C as more of a semi-familiar structured-assembler.

Trying to write anything that looks like "normal" C code is just going to lead to terrible frustration.

elmer · « **Reply #52 on:** June 04, 2016, 06:26:37 AM »

Quote from: Arkhan on May 30, 2016, 06:36:50 PM

The perk to having it in C first is, now I have the code for if I want to go plop the bastard on a different platform. 6502 is braindamaged. Converting 6502 to z80 would make me want to shoot myself. Rebuilding C to z80 and re-writing where needed would be much less moronic.

Quick question ... what C compiler are you using on the Z80?

Bonknuts · « **Reply #53 on:** June 06, 2016, 11:15:23 AM »

If you decide on CC65, you might want to look into a 6502 plugin for Eclipse. Would be nice to modify it for 6280.

Bonknuts · « **Reply #54 on:** June 06, 2016, 11:26:40 AM »

Also, about this stack optimization stuff: instead of using ZP, why not have a three or four stack system. As in, each stack is only 256 bytes (because if indexing directly), but the compiler could assign at compiler time which stack each function uses. And in the case of nesting of the same function, there could be 2 or 3 versions which the compiler could decide to use to keep the stack(s) usages from going out of bounds.

ABS,y is only +1 cycle more than ZP,y. And you'd get away from the [stack],y mode or worse manually building the offset to the stack each time (not sure if HuC does this or not).

elmer · « **Reply #55 on:** June 07, 2016, 07:44:42 AM »

Quote from: Bonknuts on June 06, 2016, 11:15:23 AM

If you decide on CC65, you might want to look into a 6502 plugin for Eclipse. Would be nice to modify it for 6280.

Hahaha ... not Eclipse ... never Eclipse!

A 177MB download and fracking Java just for an editor ... not on my computer.

I'll stick with Zeus (http://www.zeusedit.com/index.html), and sometimes the free PSPad (http://www.pspad.com/en/).

Quote from: Bonknuts on June 06, 2016, 11:26:40 AM

Also, about this stack optimization stuff: instead of using ZP, why not have a three or four stack system. As in, each stack is only 256 bytes (because if indexing directly), but the compiler could assign at compiler time which stack each function uses. And in the case of nesting of the same function, there could be 2 or 3 versions which the compiler could decide to use to keep the stack(s) usages from going out of bounds.

ABS,y is only +1 cycle more than ZP,y. And you'd get away from the [stack],y mode or worse manually building the offset to the stack each time (not sure if HuC does this or not).

Yes, I'd come to the same conclusion.

The nice thing about this, is that stack pointer can spend most of its time loaded into the Y register, and only gets kicked out when the Y register is needed to access something through a pointer. That's easy to manage in the peephole optimizer.

I'm part-way through implementing that in CC65, but it may just break things.

However, once you make the design choice to go that route, then it becomes sensible to think about removing all the C-stack pushes and pops within a function, and just calculate the stack space that a function needs and then allocate it all-at-once at the start of the function.

Again, that's something that could potentially be done during/after HuC or CC65's peephole optimizers.

Changing the frame layout would be better handled at the code-generation stage ... but that might be difficult to accomplish in either HuC or CC65.

If you can get a frame pointer that doesn't change during a function, and you use the "abs,y" addressing mode to access the stack, then stack-based variables are often as fast as using statically allocated variables.

IMHO, that could be a bit of a game-changer.

Anyway ... even more interesting than attempting to improve HuC or CC65, is the possiblity of actually getting SDCC to support the 6502.

That's just a much superior foundation to build upon than the Small-C roots of both HuC and CC65.

I've got one of the SDCC developers showing some signs of interest in working on a 65C02 code generator for SDCC, and I'll see what I can do to help that process and to try to keep his interest alive.

DarkKobold · « **Reply #56 on:** June 08, 2016, 08:16:22 AM »

I have a quick question - would the goal be to be able to port HuC code directly to CC65 or SDCC? Or would someone better at this than me actually have to rewrite all of the Turbo-related functions for these compilers?

elmer · « **Reply #57 on:** June 08, 2016, 03:34:57 PM »

Quote from: DarkKobold on June 08, 2016, 08:16:22 AM

I have a quick question - would the goal be to be able to port HuC code directly to CC65 or SDCC? Or would someone better at this than me actually have to rewrite all of the Turbo-related functions for these compilers?

Personally, I don't have enough time/energy invested in HuC to worry too much about compatibility, nor do I have enough knowledge of HuC's quirks to be the right person to try to shoehorn in HuC's way of doing things into a different environment.

I'm happy to consider trying not to break things when it doesn't have any effect upon the efficiency of the result ... but getting a "better" C compiler is my primary interest, not compatibility.

Now, if such a theoretical "better" C compiler can be made (which isn't at all certain), then there are definitely some other folks here that might be prompted/pushed into working on a HuC compatability layer.

But from what I'm seeing ... it's up to me (or someone else that has a similar interest) to prove that something better is available before anyone else will take their time to become involved.

That's not overly surprising (but yet, still a little disheartening).

If you try to change things ... then sometimes, perhaps even often, you'll fail.

But if nobody ever even risks that failure, then things never improve for anyone.

Author Topic: HuC questions. (Read 2506 times)

elmer

Re: HuC questions.

elmer

Re: HuC questions.

aurbina

Re: HuC questions.

Arkhan

Re: HuC questions.

elmer

Re: HuC questions.

touko

Re: HuC questions.

elmer

Re: HuC questions.

elmer

Re: HuC questions.

Bonknuts

Re: HuC questions.

Bonknuts

Re: HuC questions.

elmer

Re: HuC questions.

DarkKobold

Re: HuC questions.

elmer

Re: HuC questions.