Author Topic: Faster fade out code? (Read 3314 times)

Gredler · « **Reply #30 on:** October 06, 2016, 09:56:06 AM »

I feel the need to post something to make DK feel like less of an idiot.

Can't you just store all the palettes to an array, then create a loop that lerps from black to the array color or the array to black?

#Python
#C#
#Scripting

TheOldMan · « **Reply #31 on:** October 06, 2016, 10:33:34 AM »

Quote

First, why the store zero in $402 and $403?

Select palette slot 0

Quote

Second, if I want to split this into chunks of 64, how do I tell it $404 plus 64 etc

You don't use $404+anything. $404/$405 is the color.
You set $402/$403 to slot number you want. (0, 64, 128. etc)

In general you set the starting slot # in $402/$403, and send the colors to $404/$405.
The slot increments after every write to high byte (I think), so you can just loop through the
palette and pump them out. If you want to break it into 64 color chunks, write the first 64 colors,
then the next 64 colors, etc.

You can set the slot number to start an any slot, evn in the middle of a palette, afaik.

Quote

Can't you just store all the palettes to an array, then create a loop that lerps from black to the array color or the array to black?

Yes. But that won't update the palettes until you send it to the vce.
If you really want to, you can read the color from the vce, fade it, and write it back, without any intermediate array. It's just quicker to use an array. (Less overhead)

Bonknuts · « **Reply #32 on:** October 06, 2016, 04:23:04 PM »

Quote from: DarkKobold on October 06, 2016, 09:46:54 AM

Quote from: touko on September 20, 2015, 06:05:36 AM
I think the best way for optimising your routine is doing the fade in a buffer for all palettes and transfer all the buffer after a vsync in asm with a tia bloc transfer .

like that:

/* A 256 bytes buffer is enough for 8 palettes */
int my_buffer[1024];

#asm

stz $402
stz $403

tia _my_buffer , $404 , 1023

#endasm

My fade routine is close to yours, and is very fast,but in ASM .

So, I'm really confused.

First, why the store zero in $402 and $403?

Second, if I want to split this into chunks of 64, how do I tell it $404 plus 64 etc?

This is why I hate assembly. I know, I'm definitely the idiot of the thread.

It's not assembly, but direct hardware interfacing. You could access the ports directly in C.

$402/$403 make up a single port to the VCE color # or slot. Because there are 512 color slots in the VCE, it's larger than a 8bit value for a single port to handle. So the 16bit port is spread across two 8bit ports; 0x402 is the LSByte and 0x403 is the MSByte.

VCE and VDC ports tend to have what is known as "latch" system. This means when you write the upper address of a 16bit port, it triggers the transferring the contents of the two ports to the internal place it needs to go (be it VCE or VDC).

In the case of the VCE, 0x402 is the LSB, and 0x403 is the MSB and latch. Once 0x403 is written to, the contents are transferred to whatever reg internal to the VCE. But not until that latch port is accessed - so the order of port pair access if very important.

On the VCE, here are some ports:
0x402/0x403 = is the color slot you want to update.
0x404/0x405 = the color value to update on the corresponding color slot.

One other thing to note: While you can constantly tell the VCE what specific color you want to update, it does have an "auto increment" internal mechanism that automatically advances to the next color slot after a successful write/update (i.e. latch port). Same with reading color data from the VCE.

DarkKobold · « **Reply #33 on:** November 01, 2016, 05:55:13 PM »

int my_buffer[64];

fade_out()
{

   int i, clr;
   char j,k;
   for (j = 0; j <8; j++)
   {

      #asm

          stz $402
          stz $403
       #endasm
      for (i=0;i<512;i+=64)
      {
         for (k=0; k<64; k++)
         {
            clr = get_color(i+k);
            if (clr&7) clr = clr - 1;
            if (clr&56) clr = clr - 8;
            if (clr&448) clr = clr - 64;
            my_buffer[k] = clr;
         }
          vsync();
       #asm
          tia _my_buffer , $404 , 64
       #endasm
       }
   }
   cls();
   reset_satb();
   satb_update();
   vsync();
}

So, here's my attempt to split it into 64 color chunks. It... fails. Miserably. I'd assume its latching fine, and should store the current state of the latches, through each vsync.

TheOldMan · « **Reply #34 on:** November 01, 2016, 06:17:33 PM »

Quote

int my_buffer[64];
.
.
.
tia _my_buffer , $404 , 64

Ints are 2 bytes. You're only transferring 32 'colors'. Try using 128 as length

Also, be careful mixing ints and chars. HuC doesn't promote chars to ints.

touko · « **Reply #35 on:** November 01, 2016, 10:11:18 PM »

And be careful, you have only one pointer in the VCE's hardware color table .
you must select color entry for writing AND for reading(except if you want to read a pallet and write the next one) like that :

#asm
; // May be not needed here because get_color() select the pallet entry every time.
stz $402
stz $403
#endasm
for (i=0;i<512;i+=64)
{
for (k=0; k<64; k++)
{
clr = get_color(i+k);
if (clr&7) clr = clr - 1;
if (clr&56) clr = clr - 8;
if (clr&448) clr = clr - 64;
my_buffer[k] = clr;
}
vsync();
#asm
stz $402
stz $403
tia _my_buffer , $404 , 64
#endasm

Else you read in pallet 0, and write in pallet 2,as you did .

DarkKobold · « **Reply #36 on:** November 02, 2016, 01:20:20 PM »

for posterity, here is the final function. I will be testing it on hardware shortly. Whoever sees this in the near or far future, feel free to use it.

fade_out()
{

   int i, clr;
   char j,k;
   for (j = 0; j <8; j++)
   {
      for (i=0;i<512;i+=64)
      {
         for (k=0; k<64; k++)
         {
            clr = get_color(i+k);
            if (clr&7) clr = clr - 1;
            if (clr&56) clr = clr - 8;
            if (clr&448) clr = clr - 64;
            my_buffer[k] = clr;
         }
          vsync();
          clr = get_color(i-1);
       #asm
          tia _my_buffer , $404 , 128
       #endasm
       }
   }
   cls();
   reset_satb();
   satb_update();
   vsync();
}

EDIT: Oh, and a huge thanks to everyone for working with me. I'm glad that a simple fade didn't kill catastrophy.

Gredler · « **Reply #37 on:** November 02, 2016, 01:34:20 PM »

Quote from: DarkKobold on November 02, 2016, 01:20:20 PM

for posterity, here is the final function. I will be testing it on hardware shortly. Whoever sees this in the near or far future, feel free to use it.

fade_out()
{

   int i, clr;
   char j,k;
   for (j = 0; j <8; j++)
   {
      for (i=0;i<512;i+=64)
      {
         for (k=0; k<64; k++)
         {
            clr = get_color(i+k);
            if (clr&7) clr = clr - 1;
            if (clr&56) clr = clr - 8;
            if (clr&448) clr = clr - 64;
            my_buffer[k] = clr;
         }
          vsync();
          clr = get_color(i-1);
       #asm
          tia _my_buffer , $404 , 128
       #endasm
       }
   }
   cls();
   reset_satb();
   satb_update();
   vsync();
}

EDIT: Oh, and a huge thanks to everyone for working with me. I'm glad that a simple fade didn't kill catastrophy.

That would have been catastrophic

touko · « **Reply #38 on:** November 02, 2016, 09:18:20 PM »

if you want your routine more faster you can translate this in assembly

Quote

for (j = 0; j <8; j++)
{
for (i=0;i<512;i+=64)
{
for (k=0; k<64; k++)
{
clr = get_color(i+k);
if (clr&7) clr = clr - 1;
if (clr&56) clr = clr - 8;
if (clr&448) clr = clr - 64;
my_buffer[k] = clr;
}

This loop is slow as hell .

Bonknuts · « **Reply #39 on:** November 03, 2016, 07:07:21 AM »

Quote from: DarkKobold on November 02, 2016, 01:20:20 PM

for posterity, here is the final function. I will be testing it on hardware shortly. Whoever sees this in the near or far future, feel free to use it.

fade_out()
{

   int i, clr;
   char j,k;
   for (j = 0; j <8; j++)
   {
      for (i=0;i<512;i+=64)
      {
         for (k=0; k<64; k++)
         {
            clr = get_color(i+k);
            if (clr&7) clr = clr - 1;
            if (clr&56) clr = clr - 8;
            if (clr&448) clr = clr - 64;
            my_buffer[k] = clr;
         }
          vsync();
          clr = get_color(i-1);
       #asm
          tia _my_buffer , $404 , 128
       #endasm
       }
   }
   cls();
   reset_satb();
   satb_update();
   vsync();
}

EDIT: Oh, and a huge thanks to everyone for working with me. I'm glad that a simple fade didn't kill catastrophy.

It's going to cause snow on the real system if this takes too long (goes into active display), which I think it will.

Use the Txx instruction and read the color data directly into your my_buffer during vblank. Then do your modifications on the array - when each iteration is has completed (one iteration of j), then wait for vsync and Txx the buffer back to color ram port. That should prevent any snow on screen.

If you slightly modify your code like this:

Quote

fade_out()
{

int i, clr;
char j,k;

vsync();

for (j = 0; j <8; j++)
{
for (i=0;i<512;i+=64)
{

temp = i
#asm
lda temp
sta $402
lda temp+1
sta $403
tai $404, _my_buffer, 128
#endasm

for (k=0; k<64; k++)
{
clr = my_buffer[k];
if (clr&7) clr = clr - 1;
if (clr&56) clr = clr - 8;
if (clr&448) clr = clr - 64;
my_buffer[k] = clr;
}
vsync();
#asm
lda temp
sta $402
lda temp+1
sta $403
tia _my_buffer , $404 , 128
#endasm
}
}
cls();
reset_satb();
satb_update();
vsync();
}

It should get everything done during vblank and avoid snow on screen on the real system. That also includes reading in the next 64 colors as well (both transfers together only take 1.5k cpy cycles). Note: You'll need a global variable "temp" or some such name, so that you can access the function's instance variable in asm. Also, I think the read port is $404 and not $406. If not, then change it to $406.

DarkKobold · « **Reply #40 on:** November 03, 2016, 08:19:06 AM »

Quote from: Bonknuts on November 03, 2016, 07:07:21 AM

It's going to cause snow on the real system if this takes too long (goes into active display), which I think it will.

Use the Txx instruction and read the color data directly into your my_buffer during vblank. Then do your modifications on the array - when each iteration is has completed (one iteration of j), then wait for vsync and Txx the buffer back to color ram port. That should prevent any snow on screen.

So, that is why I put the vsync right before the transfer - even if the code takes two frames, the transfer will still only occur right after a vblank. I'm actually concerned that if I make the code faster, i'll have to put in delays, as the fade is already pretty fast now. Its a minimum of 64 frames already.

Granted, I need to try my new code in hardware. I'll also try yours.

As a question - I keep needing to add globals to do ASM. Could a future version of HuC do locals?

TheOldMan · « **Reply #41 on:** November 03, 2016, 08:44:57 AM »

Quote

I'm actually concerned that if I make the code faster, i'll have to put in delays, as the fade is already pretty fast now.

Suck up the need to add a delay. Use a down-counter so its tuneable. It's not too bad if you use a dedicated fade routine. You can use the wait time to do other things, like loading new gfx....
Just my opinion.

Quote

As a question - I keep needing to add globals to do ASM. Could a future version of HuC do locals?

You can do that already. It's just a pain, as they have to be accessed via the Huc Stack pointer, which is slow...

Bonknuts · « **Reply #42 on:** November 03, 2016, 09:17:41 AM »

Quote from: DarkKobold on November 03, 2016, 08:19:06 AM

Quote from: Bonknuts on November 03, 2016, 07:07:21 AM

It's going to cause snow on the real system if this takes too long (goes into active display), which I think it will.

Use the Txx instruction and read the color data directly into your my_buffer during vblank. Then do your modifications on the array - when each iteration is has completed (one iteration of j), then wait for vsync and Txx the buffer back to color ram port. That should prevent any snow on screen.

So, that is why I put the vsync right before the transfer - even if the code takes two frames, the transfer will still only occur right after a vblank. I'm actually concerned that if I make the code faster, i'll have to put in delays, as the fade is already pretty fast now. Its a minimum of 64 frames already.

Hmm.. there's an issue you might not be aware of; every time you read or write to any VCE regs (that includes $400 and $401), and you'll cause the VCE not to be able to read from pixel bus that the VDC is constantly outputting to. What happens, is that since it can't read from the pixel bus - it will output the last color (pixel) that it read from the pixel bus. You get horizontal 'stretches' of colors across the screen - i.e. snow. Not just a the borders, but anywhere on a scanline. This actually happens when you turn the display "off", but since its all one color for the screen - you can't see the pixel "stretching". This is different from other color update interfere of other systems, where if update a color while display is active - you see that color update as corruption on screen. The VCE doesn't do this, but reading and writing from any VCE port gives the same stretching behavior regardless (read or write, color update regs or other VCE regs).

Here's an example video where I purposely do it:

So any access to the VCE does this, not just reading or writing. If your routine does manage to read in and modify all 64 colors withing vblank, and you update on the following frame - then you'll be fine. And if that's the case, then don't worry about the code changes I made (unless you want more resource during vblank to do something else, but it doesn't look like it. You'd have to make a completely different system/function for that).

Quote

Granted, I need to try my new code in hardware. I'll also try yours.

Test yours first, and if it's good then don't worry about mine. Just keep my code in mind. I.e. the approach I took, as you might want to do a more flexible fade routines in the future.

Quote

As a question - I keep needing to add globals to do ASM. Could a future version of HuC do locals?

What TheOldMan said. It's a pain, because you have to generate the .s file, look at what index represents that variable, then go back and write an indirect-index load from it. It's not just that the instance variable inside the function is a stack object, but there's no clean way to access it in asm without knowing what the index is on that stack for that specific instance variable. Indeed, it would be nice for HuC to pass this on to asm block. If the assembler had a way to make scope equates, then HuC could generate a function scope equate list for each function (the index into the stack). Globals are just easier to transfer stuff to.

touko · « **Reply #43 on:** November 03, 2016, 10:24:58 PM »

Quote

As a question - I keep needing to add globals to do ASM. Could a future version of HuC do locals?

Actually as declared:
int i, clr;
char j,k;

Are treated as local variables,if you want local in asm, you must use the stack ($100 -> $106 must be safe enough),or you can use the classic push/pop .
But in fact you have a bunch of temporary global variables already reserved(like __temp,<_al,<_bl,etc..) .

Arkhan · « **Reply #44 on:** November 14, 2016, 01:49:00 PM »

Quote from: TheOldMan on November 03, 2016, 08:44:57 AM

Quote
I'm actually concerned that if I make the code faster, i'll have to put in delays, as the fade is already pretty fast now.
Suck up the need to add a delay. Use a down-counter so its tuneable. It's not too bad if you use a dedicated fade routine. You can use the wait time to do other things, like loading new gfx....
Just my opinion.

Quote
As a question - I keep needing to add globals to do ASM. Could a future version of HuC do locals?

You can do that already. It's just a pain, as they have to be accessed via the Huc Stack pointer, which is slow...

The entirety of Atlantean is written with global variables.

Just saying.

lol

ASM doesn't have a concept of local, really. you push/pop things to make them "local", simply by saving the state of all of the registers so you can f*ck around with them again before popping the stack back to reset everything, but yeah

global = <3

you'll get faster code.

Author Topic: Faster fade out code? (Read 3314 times)

Gredler

Re: Faster fade out code?

TheOldMan

Re: Faster fade out code?

Bonknuts

Re: Faster fade out code?

DarkKobold

Re: Faster fade out code?

TheOldMan

Re: Faster fade out code?

touko

Re: Faster fade out code?

DarkKobold

Re: Faster fade out code?

Gredler

Re: Faster fade out code?

touko

Re: Faster fade out code?

Bonknuts

Re: Faster fade out code?

DarkKobold

Re: Faster fade out code?

TheOldMan

Re: Faster fade out code?

Bonknuts

Re: Faster fade out code?

touko

Re: Faster fade out code?

Arkhan

Re: Faster fade out code?