Author Topic: HuC questions.  (Read 2529 times)

TheOldMan

  • Hero Member
  • *****
  • Posts: 958
Re: HuC questions.
« Reply #15 on: May 25, 2016, 08:13:53 PM »
Quote
IMHO, it's not worth my time and energy to mess with one of the compilers...

Just out of curiosity, have you thought about adding a seperate optimizer to the tool chain?
Something that could read the compiler output, and look for things to optimze, and then output a new version of the compilers output?

Quote
From what I'm seeing, all the math macros like compare, add, subtract, shift, multiply, etc are all hard-coded for 16-bit values in X:A.
Probably.  I think it would be hard enough to get right in a limited case, much less write a macro that handles all the possibilities. (But then, I'm not that good at asm. I have enough trouble with compares when I know the sizes <lol>)

Quote
That is what the last patch I made to HuC does, since I was working on the 240p test suite version for system card 1.0.

Someone should really gather all the patches together, and release an updated HuC.

Arkhan

  • Hero Member
  • *****
  • Posts: 14142
  • Fuck Elmer.
    • Incessant Negativity Software
Re: HuC questions.
« Reply #16 on: May 25, 2016, 09:00:38 PM »
The hard-16-bit code generation for comparisons is pretty much what caused slowdown in Atlantean.   I was rarely actually comparing 16-bit numbers.   Most of it is char based stuff.

On top of doing this, it's use of the X register causes great collisions with arrays that also use the X register.

a simple
Code: [Select]
if(thing[i] > otherthing[i]) comparison is a hot mess.

To me it honestly makes the simplest of C-Like-Things pretty much unusable.

It would be nice if the two things didn't use the same index register. 

Since I ended up just hand optimizing and writing compares, I never looked to into it.

Couldn't array access just use the Y register? 
[Fri 19:34]<nectarsis> been wanting to try that one for awhile now Ope
[Fri 19:33]<Opethian> l;ol huge dong

I'm a max level Forum Warrior.  I'm immortal.
If you're not ready to defend your claims, don't post em.

elmer

  • Hero Member
  • *****
  • Posts: 2153
Re: HuC questions.
« Reply #17 on: May 26, 2016, 03:46:51 AM »
That is what the last patch I made to HuC does, since I was working on the 240p test suite version for system card 1.0.

Hi Artemio, nice to see you here!  :)

I hadn't looked at your 240p test suite on GitHub and so it hadn't clicked that "aurbina" here is the same Artemio Urbina on GitHub with the HuC fork.

Thanks for you hard work on that.


Just out of curiosity, have you thought about adding a seperate optimizer to the tool chain?
Something that could read the compiler output, and look for things to optimze, and then output a new version of the compilers output?

Err ... there's already a simple optimizer built into the HuC code.

It seems to be based upon optimizing the sequence of macros that get used rather than on an individual instruction level.

CC65's optimizer seems to be a more-traditional peephole optimizer at the instruction level.

I'm not sure which is the better approach, but I can certainly see that HuC's approach makes a lot of sense given the simplicity of the code generation, and it's a huge improvement over the original Small-C code that doesn't seem to include any optimization at all.

Have you looked at HuC's source to see about adding some improvements?

Are there particular sequences of macro output that bother you?


Quote
Probably.  I think it would be hard enough to get right in a limited case, much less write a macro that handles all the possibilities. (But then, I'm not that good at asm. I have enough trouble with compares when I know the sizes <lol>)

From a code-generation aspect, isn't it mainly a case of having a 2nd set of macros for byte operations?

Then you just have a new macro that zero-extends or sign-extends the 8-bit primary register into 16-bits whenever you do a 16-bit operation with it.

But I suspect that the bigger issue is could be actually having the keep track of the size of all of the variables ... I've not dug into HuC deep enough to see if it's doing that.

CC65 already has all that stuff in place ... which is nice.


Quote
Someone should really gather all the patches together, and release an updated HuC.

That's the joy of modern development, Artemio already did gather all the important patches together ...

https://github.com/ArtemioUrbina/huc

Unfortunately, Ulrich's changes use a couple of linux functions that aren't available on Windows, and so it looks like HuC must now be compiled under cygwin and use the nasty cygwin dll on Windows.

I don't know if there's a pre-built version somewhere, perhaps on one of the other forums, or on someone's web page.

Fixing the code that he added to make it compile on Windows again would be a nice little project for a C programmer.


The hard-16-bit code generation for comparisons is pretty much what caused slowdown in Atlantean.   I was rarely actually comparing 16-bit numbers.   Most of it is char based stuff.

On top of doing this, it's use of the X register causes great collisions with arrays that also use the X register.

a simple
Code: [Select]
if(thing[i] > otherthing[i]) comparison is a hot mess.

Are those arrays global, static or local (i.e. stack-based)?

Are they arrays of 8-bit values or 16-bit values or structs?

Since I'm not really familiar with the code that HuC generates, it would be really helpful to have an example to see what it's doing.

Could you send me a ".s" file of the compiler output so that I can see the problem in a real program?


Quote
Couldn't array access just use the Y register?

Good question.

aurbina

  • Newbie
  • *
  • Posts: 33
Re: HuC questions.
« Reply #18 on: May 26, 2016, 05:24:33 AM »
Hi Artemio, nice to see you here!  :)

I hadn't looked at your 240p test suite on GitHub and so it hadn't clicked that "aurbina" here is the same Artemio Urbina on GitHub with the HuC fork.

Thanks for you hard work on that.


Yes, I couldn't change my alias to Artemio here after using aurbina 12 years ago...


Unfortunately, Ulrich's changes use a couple of linux functions that aren't available on Windows, and so it looks like HuC must now be compiled under cygwin and use the nasty cygwin dll on Windows.

I don't know if there's a pre-built version somewhere, perhaps on one of the other forums, or on someone's web page.

Fixing the code that he added to make it compile on Windows again would be a nice little project for a C programmer.


Well, I use the toolchain under windows, and I believe Ulrich did as well. Using MinGW and MSYS http://www.mingw.org/wiki/msys

I just compiled the toolchain under Windows 7 and everything worked fine, this is the machine I used to develop the Suite.

elmer

  • Hero Member
  • *****
  • Posts: 2153
Re: HuC questions.
« Reply #19 on: May 26, 2016, 06:42:28 AM »
Well, I use the toolchain under windows, and I believe Ulrich did as well. Using MinGW and MSYS http://www.mingw.org/wiki/msys

Hmmm ... that's weird!  :-k

I abandoned the original mingw/msys project a few years ago because it was getting so old and out-of-date.

I'm using the mingw-w64/msys2 combination instead which has been an absolute pleasure to work with after my experiences with mingw/msys.

https://sourceforge.net/projects/msys2/

This is the first time that I've heard of the old mingw having a feature that the new mingw-w64 is missing.

In this case, I can't compile Ulrich's HuC source because he's using "fmemopen", which the original HuC project didn't use.

It wouldn't be hard to rewrite the output code to use a different method instead, but I'm not at the point of wanting to do so, yet.

Gredler

  • Guest
Re: HuC questions.
« Reply #20 on: May 26, 2016, 07:01:06 AM »
Someone should really gather all the patches together, and release an updated HuC.

I can't speak for DK who's handling 99.999% of the HuC lifting, but yes please! :)

dshadoff

  • Full Member
  • ***
  • Posts: 175
Re: HuC questions.
« Reply #21 on: May 26, 2016, 12:32:03 PM »
Just out of curiosity, have you thought about adding a seperate optimizer to the tool chain?
Something that could read the compiler output, and look for things to optimze, and then output a new version of the compilers output?

Err ... there's already a simple optimizer built into the HuC code.

It seems to be based upon optimizing the sequence of macros that get used rather than on an individual instruction level.

CC65's optimizer seems to be a more-traditional peephole optimizer at the instruction level.

I'm not sure which is the better approach, but I can certainly see that HuC's approach makes a lot of sense given the simplicity of the code generation, and it's a huge improvement over the original Small-C code that doesn't seem to include any optimization at all.

Have you looked at HuC's source to see about adding some improvements?

Are there particular sequences of macro output that bother you?

No, no... the optimizer in HuC's output does a limited amount of peephole optimization as well.
I spent the better part of a month on it (and cycle-counting the MACROs) in 2001, and got a roughly 100% speed improvement and greater than 10% code size shrink.  However, I prioritized my time to optimize the most common and worst offenders that I encountered.

...But it certainly didn't make up for the 16-bitness which is intrinsic to the compiler.

If you want to spend some time on a compiler which deals more efficiently with char types, feel free to use whatever you want from the support libraries of HuC - I would certainly support such an effort.

Bear in mind, though, that the most often-heard complaint I've heard over the many years is that this is a "K&R" compiler, and not ANSI.  This leads me to believe that the users of the compiler may not be as willing to compromise on feature set as you are... although, in fairness, today's users are a somewhat different group of people than the users of 10 years ago.

Dave
« Last Edit: May 26, 2016, 12:35:57 PM by dshadoff »

elmer

  • Hero Member
  • *****
  • Posts: 2153
Re: HuC questions.
« Reply #22 on: May 26, 2016, 02:00:56 PM »
No, no... the optimizer in HuC's output does a limited amount of peephole optimization as well.

That's cool! Thanks for correcting my mistake.  :D


Quote
...But it certainly didn't make up for the 16-bitness which is intrinsic to the compiler.

Yep, not much you can do about that at the "optimizer" stage if the compiler has already thrown away that information!


Quote
Bear in mind, though, that the most often-heard complaint I've heard over the many years is that this is a "K&R" compiler, and not ANSI.  This leads me to believe that the users of the compiler may not be as willing to compromise on feature set as you are... although, in fairness, today's users are a somewhat different group of people than the users of 10 years ago.

Hahaha, I totally agree with that complaint!  :wink:

I can accept certain limitations ... but I won't accept K&R syntax.

Any coding should be in a semi-modern syntax, even if there are some implementation-gotchas to consider.

I think that Ulrich already ported over the ANSI syntax into his version of HuC, and that's a big step forward, at least to me.

The obvious first improvement to make has got to be in stack access.

That needs to be "__stack,X" and not "(__Stack),Y".

CC65 looks (so far) to be the a good base to improve from.

The other alternative is to go for an even-smarter compiler.

SDCC is smart-enough to actually examine the call-chain for every function at link time and turn all those stack-based local-variable accesses into absolute locations. That's as fast to process as you can possibly get!

But (and there's always a "but"), SDCC doesn't support the 6502 at all.

So, if (and it's still "if") I choose to mess with this stuff ... is it easier to hack improvements into CC65, or to add new processor support into SDCC?

There's still lots of research and thinking to do.

Arkhan

  • Hero Member
  • *****
  • Posts: 14142
  • Fuck Elmer.
    • Incessant Negativity Software
Re: HuC questions.
« Reply #23 on: May 26, 2016, 02:13:42 PM »
I was using global variables.  both char and int arrays.   Local variables just make it worse.

I don't have an .S handy at the moment because the code has been all hand optimized now.

but basically just do

int big[5];
int boobies[5];

as globals, and then do if(big > boobies){ big = whatever;}

You'll see what I mean.
[Fri 19:34]<nectarsis> been wanting to try that one for awhile now Ope
[Fri 19:33]<Opethian> l;ol huge dong

I'm a max level Forum Warrior.  I'm immortal.
If you're not ready to defend your claims, don't post em.

elmer

  • Hero Member
  • *****
  • Posts: 2153
Re: HuC questions.
« Reply #24 on: May 28, 2016, 05:57:11 AM »
a simple
Code: [Select]
if(thing[i] > otherthing[i]) comparison is a hot mess.

To me it honestly makes the simplest of C-Like-Things pretty much unusable.

OK, I ran a quick test. Yuk! That generates horrible code!


Quote
It would be nice if the two things didn't use the same index register. 

Couldn't array access just use the Y register?

Look at the code that CC65 generates.

It's also descended from Small-C, and uses that same "(sp),Y" stack that HuC does, but the code generation has been optimized a lot to improve things like those arrays.

I've included an example of what I think CC65's code would look like if I changed the way that its stack worked.

There's still room for optimizations, but it's one heck of a lot better.

CC65's peephole optimizer could easily be extended to remove one of the redundant loads, and the top-of-stack compare code might be improvable within the limits of the compiler.

But I don't think that we could ever get CC65 (or HuC) to produce code like the hand-optimized version that's shown last.

For that, the compiler would need to do a lot of analysis that it just doesn't do.

We'd probably get much closer if we could add 65C02 support to SDCC, but that would be a major project.


**********************************************
Original C Source ("char" is unsigned)
**********************************************

char arr1[8];
char arr2[8];

void foo1 (char index)
{
  if (arr1[index] < arr2[index]) foo2();
}


**********************************************
HuC generated code
**********************************************

       __pushw
       __ldwi   _arr1
       __pushw
       __ldb_s  2
       __addws
       __ldb_p
       __pushw
       __ldwi  _arr2
       __pushw
       __ldb_s  4
       __addws
       __ldb_p
         jsr    lt
       __lbeq   LL3
         call   _foo2
LL3:   __addmi  2,__stack
         rts


**********************************************
CC65 generated code
**********************************************

       jsr  pusha
       lda  (sp)
       tay
       lda  _arr1,y
       jsr  pusha0
       ldy  #$02
       lda  (sp),y
       tay
       lda  _arr2,y
       jsr  tosicmp0
       bcs  L0005
       jsr  _foo2
L0005: jmp  incsp1


**********************************************
Possible from CC65(with no extra optimization)
**********************************************

       dex
       sta  __lo_stack+0,x
       ldy  __lo_stack+0,x
       lda  _arr1,y
       dex
       sta  __lo_stack+0,x
       ldy  __lo_stack+1,x
       lda  _arr2,y
       cmp  __lo_stack+0,x
       bcs  L0005
       jsr  _foo2
L0005: inx
       rts


**********************************************
Hand Optimized (unlikely to easily achieve)
**********************************************

       tay
       lda  _arr1,y
       cmp  _arr2,y
       bcs  L0005
       jmp  _foo2
L0005: rts

.endproc
« Last Edit: May 29, 2016, 09:28:09 AM by elmer »

Arkhan

  • Hero Member
  • *****
  • Posts: 14142
  • Fuck Elmer.
    • Incessant Negativity Software
Re: HuC questions.
« Reply #25 on: May 28, 2016, 06:48:58 PM »
yeah I basically just gave up on expecting a C compiler to generate fast enough code.

I use it to quickly get the moving parts behaving like I want (doing game AI in assembly and experimenting with it is a real pain in the ass)...

and then I just #asm#endasm the calls, because as you also demonstrate, you get better code if you know what the hell you're trying to do.

the compiler won't know that and has to be a bit generic.


but ughhh, yeah, that code is a mess from HuC.
[Fri 19:34]<nectarsis> been wanting to try that one for awhile now Ope
[Fri 19:33]<Opethian> l;ol huge dong

I'm a max level Forum Warrior.  I'm immortal.
If you're not ready to defend your claims, don't post em.

Bonknuts

  • Hero Member
  • *****
  • Posts: 3292
Re: HuC questions.
« Reply #26 on: May 29, 2016, 08:58:39 AM »
Does CC65 having pragma fastcalls like HuC? I used it in place of #asm#endasm for HuC. It really makes HuC powerful in the way it integrates with regular C code.
« Last Edit: May 29, 2016, 09:01:17 AM by Bonknuts »

elmer

  • Hero Member
  • *****
  • Posts: 2153
Re: HuC questions.
« Reply #27 on: May 29, 2016, 09:33:43 AM »
Does CC65 having pragma fastcalls like HuC? I used it in place of #asm#endasm for HuC. It really makes HuC powerful in the way it integrates with regular C code.

I have no idea ... what does "#pragma fastcall" do in HuC?

I'd look it up in the documentation ... but I can't find it.

Bonknuts

  • Hero Member
  • *****
  • Posts: 3292
Re: HuC questions.
« Reply #28 on: May 29, 2016, 10:16:46 AM »
It's a hidden feature :) It basically allows C internal function calling to an ASM routine on the backend lib. You can even do a certain level of argument overloading. But the real advantage of it, is that you get to control how the arguments are passed (ZP, pointers, etc).

 You could do something like if ( array_access(arr1, idx) < array_access(arr2, idx) ) foo();.

 Here's an example from AC.h
Code: [Select]
/*
 * ac_vram_xfer( AC reg (word), vram addr (word), num bytes (word), chunk size(byte))
 * ac_vram_xfer( AC reg (word), vram addr (word), num bytes (word), chunk size(byte), const SGX )
 */
#pragma fastcall ac_vram_xfer(byte al, word bx, word cx, byte dl );
#pragma fastcall ac_vram_xfer(byte al, word bx, word cx, byte dl, byte ah );

 Inside my ac_lib.asm file that resides inside library.asm, there is a _ac_vram_xfer.4 and a _ac_vram_xfer.5 . Depends on how many arguments are passed to the function, HuC will call one of the follow corresponding (it knows to look for the .x at the end of it). You can also default with no .x in the label. Obviously in the above, the longer version (_ac_vram_xfer.5) is for SGX video ports.


 You can even do stuff like:
Code: [Select]
/*
 * Arcaded card address reg function: 24bit value, 1 byte(high) and 1 word(mid/low).
 */
#pragma fastcall ac_addr_reg1( byte ac_reg_1_high, word ac_reg_1_low ) nop;
Instead of ZP regs and such, the values get written directly to ports. The nop; at the end tells the compiler not to call a function.

 Anyway, this is how I got around slow pointer/array access in HuC, but in a way that didn't require #asm#endasm. It's really clean and fast, and can be nested inside other C code, etc. I had functions for local data (static mapped ram) and far data. Etc.

 If you look at the HuC C source code for the compiler, you'll see a bunch of internal pragma fastcall definitions/code. I only discovered this, because there were lib functions that weren't in the ASM libraries.
« Last Edit: May 29, 2016, 10:24:22 AM by Bonknuts »

elmer

  • Hero Member
  • *****
  • Posts: 2153
Re: HuC questions.
« Reply #29 on: May 29, 2016, 11:18:50 AM »
It's a hidden feature :) It basically allows C internal function calling to an ASM routine on the backend lib. You can even do a certain level of argument overloading. But the real advantage of it, is that you get to control how the arguments are passed (ZP, pointers, etc).

I don't think that CC65 has the same control of where parameters are put, but it could probably be added if really needed.

If I'm understanding what you're saying, then you can accomplish basically the same thing in CC65 with the normal C method of doing such things ... create a preprocessor macro.

BTW, it looks like Uli's update to HuC finally adds parameters to macros.

CC65 allows you to declare parameters and locals to be "register", and then puts them in a limited area of zero-page.

Just like you're saying in HuC, this is a useful way to speed up pointer access.

I've started changing CC65's code-generation to see if it's going to be easy.

So far, so good.

The way that it's preserving whether operations should by signed or unsigned, char, word or long, is definitely helping the code generation.