Ok, I have some surprising result.
First, the method.
A test loop of..
loop:
; put something here
inc counter
bne loop
That's an 11 cycle overhead for the loop. I wait until active display (a flag set during h-int), then I set the timer for two intervals (2048 cycles). On Timer interrupt, I read "counter" value and store it. Once the loop is over with, I show the counter value on screen. It's simple. It's not exact because of instruction cycle jitter to the interrupt and it might miss +1 to the counter. But it's close enough for now.
Here are the tests:
TRB $0002 = 103
TRB $0003 = 77
TSB $0002 = 103
TSB $0003 = 77
STA $0002 = 121
STA $0003 = 121
ST0 #00 = 128
ST1 #00 = 128
ST2 #00 = 128
Keep in mind that the counter starts with 1, not zero.
For STA's: (2048/121) - 11 = 5.925 cycles. Pretty close to the 6 cycles expected.
For STx's: (2048/128) - 11 = 5 cycles on the dot! Heh.
TSB/TRB for $0002: (2048/103) - 11 = 8.883 cycles. Close to the 9 that I was expecting.
TSB/TRB for $0003: (2048/77) - 11 = 15.597 cycles! I was NOT expecting that.
The TRB/TSB at 8.883 is a little bit suspicious. It would have expected the fraction part to be larger. It's possible that the base is 8 cycles (7+1), but that the VDC is using /RDY which is in master clock delays (not master_clock/3). I ran this test in mednafen and it comes out as 7.96 cycles. Speaking of mednafen, and I don't remember which version I'm testing with off hand, but it appears to be running STx opcodes at 4 cycles instead of 5 (4+1). Given the granularity of this test, I would have expected 4.9 something for these STx opcodes. So it's possible that it's really (4+1)+fraction stall by the VDC.
But the biggest shocker is TRB/TSB on $0003. Wow. That's 6.714 cycles slower than I expected. I'm guess that the instruction is hitting the saturation point of the VDC's open access slots. Because it's reading from vram, modifying, and writing back. The RMW part is near the end of the instruction, so it's going to be fast. The odd fractional value is probably because of alignment to the 8 dot clock of [CPU BAT CPU
CPU CG0 CPU CG1]. I should probably test this during vblank to see if it's faster.
Update #2:
I did another two test:
lda $0002; sta $0002
and
lda $0003; sta $0003
The read/store of $0002 is 12.01 cycles for the pair. It should be something like 11.9xx, so there seems to be a fractional delay. But read/store of $0003 is exactly the same as TxB of $0003 = 15.597. It's exact. So there's definitely some sort of delay in the immediate switching of reading vram to writing vram on the VDC side.
Update #3:
LDA $0003; AND #12; ORA #34; STA $0003
A total of 15.947 cycles for those four instructions. That's pretty much right on the money.