With context this might seem a little confusing, but..
st2 #$xx
st2 #$xx
st2 #$xx
st2 #$xx
st2 #$xx
st2 #$xx
st2 #$xx
st2 #$xx
bbr0 zp0,.skip0
rts
.skip0
st2 #$xx
st2 #$xx
st2 #$xx
st2 #$xx
st2 #$xx
st2 #$xx
st2 #$xx
st2 #$xx
bbr1 zp0,.skip1
rts
.skip1
(Doesn't have to be all ST2 opcodes; can be st1/st2 as well)
I.e. you can break up long runs of pixel writes as short blocks, and control the length (in a course amount) with a bitmask in a series of ZP variables. In this example, I'm writing lines instead of columns because I have the VDC write incrementor set just right.
My transparency demo (that uses TF4 BG) could really benefit from this. You can do dynamic tiles are columns, or as single bitmap lines (the VDC allows either write method). Each have their advantages and disadvantages. Column writing allows easy re-positioning to make a large area scroll horizontally with only have frames - but it's more complicated if you do vertical scrolling (shifting). Line mode allows vertical scrolling, as well as doing hsync sine wave effects and vertical scaling effects, as well as easy vertical mirroring - but is more difficult to scroll horizontally.
All this is in relation to really large "brick style" dynamic blocks. Stuff half the size of the screen, or possibly larger than the screen itself (I have such a demo effect that uses this, it just needs a real demo to be part of).
The TF4 transparency demo for PCE, if anyone has seen it, basically leaves the first two planes of a PCE tile (p0,p1) for tile data. That's 4 colors, but more if you use subpalettes (3*16 + 1= 49 colors to be exact). The second composite tile of the 4bit tile is plane 2/3, which the cpu writes a large dynamic tileset data to. The cloud layer is made up of three colors, and the tiles are 4 colors. Each color of the could layer corresponds to a set of 4 hue tinted colors in the current subpalette. With color #0 on the cloud layer showing normal colors of the tile underneath it. Like I said, you can use different subpalettes for any of the tiles, as well as they all have a cloud hue tinted set in them (can be whatever and different from tile to tile).
Two issues with this approach for the demo: the background "area" that's affected by the transparency part needs to be actual bitmap buffer. This is easily done with tiles; you just stream the right edge of the screen (off screen) with a single column of tiles when needed. Not a big deal and barely any cpu resource to do this (the nice thing is you can do easily tile flipping support this way that the PCE normally doesn't support). In the TF4 pce TP demo, only the area where the cloud layer is, needs to be this bitmap thingy. The rest of the tilemap can be regular tiles, meaning the buffer doesn't need to be that large if you don't need it to be.
The second issues; the most efficient way to write the cloud layer. If you've seen the demo, you'll notice at some point that when the map keeps scrolling, transparency overlay gets stuck. That's because the demo was never finished. But it's also because the demo doesn't handle "wrap around". So what you're seeing is a linear stretch, and then something it can't handle (wrap around). To handle wrap around, you need to be able to write with specific start and stop points. In the TF4 demo, it does "line" mode. This allows it to write 1/8 of the whole image in one long st1/st2 opcode output to the VDC. To put that into prospective: 256x112 (I think that's the height of the cloud layer) would take 256x112 @ 2bit = 7,168 bytes to write to vram. At 5 cycles a byte, that's 35,840 cpu cycles or 30% cpu resource. In actuality though, since there are gaps in the cloud layer, those can be stores as ST2 opcodes - saving one write per 8x1 blank area. For the sake of example, let's say that is 15% blank space. That brings down the cpu write sequence to ~25% cpu resource. Now notice that the cloud layer is at the bottom of the screen? 7,168 * 5 cycles = 35,840 cycles / 455 cycles (one scanline) = 79 scanlines. This means I can actually do this during the top of active display; I have enough time to race the display, leaving vblank free and leaving the rest of active display free (I'll just assume active display is 224 scanlines tall). Another method, if the cloud layer was at the top of the screen instead of the bottom, would be the "trail" the display so the changes being made on that frame don't show as you're writing - blah blah blah.
Here's the video.. (touko uploaded it)
Draft: I have a little more to write, so I'll either update this post or just post some more..
Ok, so the second issue isn't cpu resource (at least not yet) but getting the dynamic offset image to the screen image buffer, and have it wrap around. One easy to do to his is store the image as column data, and after you cycle through all 8 frames of the shifted image, offset the column to + 1. Of course, mod (%) by the length for wrap around. The concept is simple. But here in lines the problem; the composite tile format. While this helps us in letting the VDC do the transparency work for us (this is how plane format facilitates transparency effects - a crude way), the composite format is now a hindrance. For column writes, can only write to one set of plane pairs, this means only 16 bytes can be written before you have to increment the vram pointer. This is going to take somewhere between 28-34 cycles *if* you embedded that into the graphic data itself (using A:X to hold vram offset, or X as an index to a table). That adds another 15k cycles on top of the 35k (unoptimized version; no gap optimiziation). Another approach is to write only one line of pixels per tile; write 1/8 of each composite tile. If the height of the image block to write to the screen is 112 pixels, that's 14 lines at 2 bytes each.. so 28 bytes written before a vram pointer re-position is needed. 7,168 / 28 * 34cycles = 8,704 cycles. Not bad. Almost cut the overhead by half.
If we know the block of data is wider than it is tall, we could optimize for that horizontal writes - but this introduces other problems. With column writing, it's easy to offset every 8 pixels when needed. Line writing doesn't allow that. If you break line writing up to smaller segments, like the original code show above, you can have a string of data at a smaller course length to the buffer. You can even jump into the middle of a string of data (opcodes), or anywhere from start to finish. If you think about this from the left side vs right side problem, dealing with wrap around when there isn't alignment, the right side is going to be the problem. The left side can be dealt with by jumping into the middle or whatever offset of the stream length (above is 8 pixel writes, to segments of 8 pixels - and the shift frame takes care of the intra pixel offset inside that 8pixel segment).
But how to deal with the right side. One way is to handle an over spill area. This is an offscreen area allows the extra data to be ineffective to the display. The downside is now the buffer is a little bit wider. Not doing an over spill area means having to write out manually the remaining bytes (pick your poison). Both methods are more complex than the column mode, and both methods require a good size jmp/jsr table for offsets. They also require data of "sets" (shift sets) to be banked align so that one routine works for all data shift sets. And to top it off, you still have to reposition the vram pointer once per "line" write (if you start from the left first, this is only once per line even on a wrap around point). What's the advantage in cycles? Well, the current cloud layer is something like 256px wide. So that's 32 paired writes (32 8x1 @2bit line segments) for 64 bytes. The code for 8x1 cells has an overhead of 8cycles (BBR instruction), so that averages out to be 5.5 cycles a byte written in a block of 8 segments (16 bytes). And only one 26-34cycle over head for vram reposition. But you'll have overhead from the spill area write, each line, to include into that account.
In the end, the line method will be super convoluted and might only be slightly faster than the column mode version, and generating those tables for the offsets is going to be a huge pain in the ass - but all said and done, the line method would allow you to sine wave effects both horizontally and vertically like with a normal PCE map/bg layer, on top of having vertical scrolling ability too (animate the layer scrolling up from the bottom). The column method is easier, but can't do anything like the line method can do. So like I said, super convoluted but it's also a one and done type of deal. Once working, it'll be a really powerful effect for the PCE.
As far as those jump tables are concerned, I'd most surely write a PC app to generate that code. No amount of macros in PCEAS is going to make that an easy job.
Is this extreme? You bet. But is this doable? Completely. And from cpu resource perspective, incredibly doable. It might not be representative of what any dev would do back in the day, but this isn't what that's about. This is about pushing the system to its limits - to see what it can do.
Just to note: the cloud layer does not have to remain static. It can scroll at its own speed in either direction (right or left). Both methods work, and both methods allow the cloud layer to scroll left or right, but line method allows for additional effects to be applied to that layer.
Also note, the transition line.. right above the cloud layer - those are no longer 2bit tiles. They're 4bit tiles, a allowing the mountain range to use 4 colors total, and still have a 5th one as well as any static cloud pixel data (more colors). So no, the whole screen doesn't need to be made up for 2bit colored tiles. But even for the areas that are, you still have subpalettes to break up the color usage, and that the transparecy layer will still apply to those subpalettes.