Author Topic: CC65 and the PCE  (Read 5720 times)

touko

  • Hero Member
  • *****
  • Posts: 953
Re: CC65 and the PCE
« Reply #120 on: March 10, 2016, 09:47:41 PM »
EDIT: Sorry i was wrong, VRAM read/write addresses are good  :P
You only shift tile/sprite addresses, not VRAM ones .

I tried to load your font with pceas and it works fine .

Try this for loading your font:
                lda #.bank( gfx4BPP_font )
      tam #2
      inc A
      tam #3
      
      st0 #0
      st1 #<( $800 )
      st2 #>( $800 )
      
      st0 #2
      tia gfx4BPP_font , $0002 , $800
« Last Edit: March 10, 2016, 10:35:47 PM by touko »

elmer

  • Hero Member
  • *****
  • Posts: 2153
Re: CC65 and the PCE
« Reply #121 on: March 11, 2016, 04:13:27 AM »
Your vram_clearBAT code isn't correct ... you're using Y without initializing it.

Your loop should look more like ...


                lda     #$80
                ldx     #32
@rowLoop:       ldy     #32
@colLoop:       sta     a:VDC_DATA_LO
                stz     a:VDC_DATA_HI
                dey
                bne     @colLoop
                dex
                bne     @rowLoop



Your font loading code isn't correct ... you're using Y without initializing it (because font_loadSize is $0800).

                ; load loop goes here
                ldx     font_loadSize
                beq     @checkOuter
                cly
@fontLoadLoop:  lda     (fontDataAddr),y



Fix those bugs, and you'll find that you can write to VRAM $0800 as your expect.
« Last Edit: March 11, 2016, 04:25:53 AM by elmer »

freem

  • Newbie
  • *
  • Posts: 3
Re: CC65 and the PCE
« Reply #122 on: March 11, 2016, 04:52:39 AM »
I tried to load your font with pceas and it works fine .

Try this for loading your font:
...
tia gfx4BPP_font , $0002 , $800


I keep forgetting I'm not coding straight 6502 and have access to things like this; thanks :)
Though I did have to change $800 to $800*2 since the VRAM expects word values.

Your vram_clearBAT code isn't correct ... you're using Y without initializing it.

aha, so I was getting by on pure luck. ;) I figured that routine was broken somehow, due to being a late night coding exercise.

Your font loading code isn't correct ... you're using Y without initializing it (because font_loadSize is $0800).

ah, there we go, that would be it. guess I wasn't running the branch code properly in my head :)

Thanks for your help, elmer and touko; this weekend is going to be a lot of fun.

here's the fixed example, in case anyone wants it:
http://www.ajworld.net/pcedev/pce-example01_ca65-fixed.zip
other comments/critique are welcome :)

touko

  • Hero Member
  • *****
  • Posts: 953
Re: CC65 and the PCE
« Reply #123 on: March 11, 2016, 06:22:23 AM »
Quote
I keep forgetting I'm not coding straight 6502 and have access to things like this; thanks :)
Though I did have to change $800 to $800*2 since the VRAM expects word values.
Of course, $800 was just a value to see if your font start to load correctly at $800 in VRAM ..

Quote
other comments/critique are welcome :)
A little tips,no need to write the LOW value in VRAM if it's the same all the time in your loop,because it's buffered .
Write it before the loop, and write only the HIGH byte in your loop, it's 2x time faster  :wink:
For exemple, writing 32 words in VRAM

   ldx #32
   lda #low_byte
   sta $0002
loop:
   stz $0003
   dex
   bne loop
« Last Edit: March 11, 2016, 06:30:36 AM by touko »

elmer

  • Hero Member
  • *****
  • Posts: 2153
Re: CC65 and the PCE
« Reply #124 on: March 11, 2016, 08:19:49 AM »
A little tips,no need to write the LOW value in VRAM if it's the same all the time in your loop,because it's buffered .
Write it before the loop, and write only the HIGH byte in your loop, it's 2x time faster  :wink:
For exemple, writing 32 words in VRAM

   ldx #32
   lda #low_byte
   sta $0002
loop:
   stz $0003
   dex
   bne loop

Hmmm ... I'm not sure if that counts as "very-clever", or "fugly". Either way ... for-gawd's-sake, please put some comments in the code when you do that, or it'll bite you in the ass when you least expect it.

BTW ... has it been confirmed that you get the full-benefit of the trick on real hardware?

I'm curious if the loop is still slow enough to avoid overrunning the CPU-to-VRAM bandwidth and causing cycle stalls from that (the way that we've only just found out about the unexpected TSB/TRB delay).

touko

  • Hero Member
  • *****
  • Posts: 953
Re: CC65 and the PCE
« Reply #125 on: March 12, 2016, 05:41:58 AM »
Quote
BTW ... has it been confirmed that you get the full-benefit of the trick on real hardware?
Yes tested on my SGX,the 2 bytes are buffered, and writed to VRAM only when you write to $0003 .
You canot update only the low byte in VRAM  .

It's very well explained in charles's doc .
« Last Edit: March 12, 2016, 05:52:04 AM by touko »

elmer

  • Hero Member
  • *****
  • Posts: 2153
Re: CC65 and the PCE
« Reply #126 on: March 12, 2016, 06:09:11 AM »
Quote
BTW ... has it been confirmed that you get the full-benefit of the trick on real hardware?
Yes tested on my SGX,the 2 bytes are buffered, and writed to VRAM only when you write to $0003 .
You canot update only the low byte in VRAM  .

It's very well explained in charles's doc .

I not arguing about how the buffering works ... just questioning the actual speed improvement.

For instance ...


       st1 #$00 ; 4+1 cycles
.loop: st2 #$00 ; 4+1 cycles
       st2 #$00 ; 4+1 cycles
       st2 #$00 ; 4+1 cycles
       st2 #$00 ; 4+1 cycles
       st2 #$00 ; 4+1 cycles
       st2 #$00 ; 4+1 cycles
       st2 #$00 ; 4+1 cycles
       st2 #$00 ; 4+1 cycles
       dex
       bne .loop


Will this really clear VRAM 8-words-at-a-time, at 40 cycles per 16 bytes, i.e. 2.5 cycles per byte?

As Bonknuts found out with the TSB/TRB test ... there was an unexpected delay in the turn-around between the read and the write, presumably caused by the VDC having to wait for a CPU read/write slot in the VRAM cycle timings.

touko

  • Hero Member
  • *****
  • Posts: 953
Re: CC65 and the PCE
« Reply #127 on: March 12, 2016, 10:58:17 PM »
Ah ,ok, i understand, of course it's faster in a dev point of view(and I do not take in any count some latency here), but i don't know if it's really faster due to some latency between each write,and i don't know if they are those latencies for stx, i think they are also present for lda/sta .

The tsb/trb case is apart because it read/write the same VRAM region (it needs 2 CPU slots),and bonk's tests was in low resolution mode.
I'll do some tests with the 2 methods to see if there is a difference.

Quote
presumably caused by the VDC having to wait for a CPU read/write slot in the VRAM cycle timings.
I think so, and if it's the case, latency should be done in MED/HIGH res mode.
« Last Edit: March 12, 2016, 11:33:29 PM by touko »

Bonknuts

  • Hero Member
  • *****
  • Posts: 3292
Re: CC65 and the PCE
« Reply #128 on: April 02, 2016, 02:57:43 PM »
The speed penalty from my TRB/TSB, comes from the VDC doing something internal from vram read to vram wright - hence the delay.

 I didn't encounter any additional stuff for just sequential back-to-back writes (on screen or otherwise). Though technically you could hit an unavailable slot during active display, but that's going to be in partial master clock cycles (/RDY) and not whole instruction cycles. But I never noticed it in the timings of my sequential write tests (the error is probably spread over too many writes to be noticeable).

touko

  • Hero Member
  • *****
  • Posts: 953
Re: CC65 and the PCE
« Reply #129 on: April 07, 2016, 09:22:56 PM »
Quote
The speed penalty from my TRB/TSB, comes from the VDC doing something internal from vram read to vram wright - hence the delay.
Yes TRB/TSB assume that there is no delay, because it suppose you use them on RAM,it's not the case with VDC, you only can write/read when a CPU slot is available, and this can cause delay,the more is for 256 px res ..
« Last Edit: April 07, 2016, 09:25:08 PM by touko »

Bonknuts

  • Hero Member
  • *****
  • Posts: 3292
Re: CC65 and the PCE
« Reply #130 on: April 08, 2016, 10:16:21 AM »
Quote
The speed penalty from my TRB/TSB, comes from the VDC doing something internal from vram read to vram wright - hence the delay.
Yes TRB/TSB assume that there is no delay, because it suppose you use them on RAM,it's not the case with VDC, you only can write/read when a CPU slot is available, and this can cause delay,the more is for 256 px res ..
But that's the point; I don't think it's a cpu slot availability thing with TRB/TSB. I think it's something more. Because I can do a series of st2 or sta on the msb, or straight read lda on the msb, and never really see much of that slot offset phase in negative performance. With TRB/TSB, on the MSB, it's has something to do with the VDC switching from a vram read operation to a vram write operation - and not the cpu access slot availability.