I fixed a bug in the driver for channels 1-3 that have looping points (it would update based on channel 0 data, for the other channels - which could go into an infinite loop on the interrupt routine). Gonna move the update processor inside the last TIRQ call, so timing won't be a problem. Vblank interrupt still resync's the TIMER though. I also found that XM/MOD files tend to have the looping data inside the wavefile itself ("smpl<"), so I'm gonna update wav2sixbit util to use this. I release those updates soon.
I'm gonna make official thread for the PCM drivers (yes, there will be multiple versions), which I'll update when they mature. So this thread is still for discussing WIP PCM stuffs/concepts.
As of now, since the first PCM Driver incarnation seems to be working quite well - I'm gonna start the 15.6khz version. The approach is a little different. Here's a basic driver setup:
;call ;8
jmp [pointer] ;7
active.part.one
pha ;3
phy ;3
lda $0000 ;6
.RCR
;// RCR routine
inc <IndexRCR ;6
bne .change.active.part ;2
st0 #RCR ;5
sty $0002 ;6
lda <VDC_REG ;4
sta $0000 ;6
; = 29
;42 cycles for overhead (includes INT call itself)
;29+42 = 72 * 262 = 18,864 or 15.8% cpu
.PCMDriver
dec <DriverCnt ;6
beq .reload ;2
.re-entry
ldy <Bffr_IY ;4
inc <Bffr_IY ;6
;18
stz $800
lda buffer1,y
sta $806
lda #$01
sta $800
lda buffer2,y
sta $806
lda #$02
sta $800
lda buffer3,y
sta $806
lda #$03
sta $800
lda buffer4,y
sta $806
; 66 cycles
; 66+18 = 84
; @256 times = 21504 cycles or 18% cpu resource
; Note: 21.5% for 6 channels.
ply ;4
pla ;4
rti ;7
.reload
dec <CounterTableIY
bpl .NoResetIY
lda #$5
sta <CounterTableIY
.NoResetIY
ldy <CounterTableIY
lda .CounterTable,y
sta <DriverCnt
jmp .re-entry
.CounterTable
.db 43,44,44,43,44,44
Here's a version with a HDMA style system for doing Hint FX.
;call ;8
jmp [pointer] ;7
active.part.one
pha ;3
stz $402
stz $403
lda <BG0.l
sta $404
lda <BG0.h
sta $405
phy ;3
lda $0000 ;6
ldy <IndexRCR
lda Dolist,y
beq .skip
;// X scroll
bit #$01
beq .next1
st0 #$07
lda Xoffset,y
sta $0002
.next1 ;// Y scroll
bit #$02
beq .next2
st0 #$08
lda Yoffset,y
sta $0002
.next2 ;// Sprite off/on, BG off/on
bit #$04
beq .next3
st0 #$05
lda VDC_disp,y
sta $0002
.next3 ;// BG color #0
bit #$08
beq .skip
lda BG0.lsb,y
sta <BG0.l
lda BG0.msb,y
sta <BG0.h
.skip
.RCR
;// RCR routine
inc <IndexRCR ;6
bne .change.active.part ;2
st0 #RCR ;5
sty $0002 ;6
lda <VDC_REG ;4
sta $0000 ;6
; = 29
;42 cycles for overhead (includes INT call itself)
;29+42 = 72 * 262 = 18,864 or 15.8% cpu
.PCMDriver
dec <DriverCnt ;6
beq .reload ;2
.re-entry
ldy <Bffr_IY ;4
inc <Bffr_IY ;6
;18
stz $800
lda buffer1,y
sta $806
lda #$01
sta $800
lda buffer2,y
sta $806
lda #$02
sta $800
lda buffer3,y
sta $806
lda #$03
sta $800
lda buffer4,y
sta $806
; 66 cycles
; 66+18 = 84
; @256 times = 21504 cycles or 18% cpu resource
; Note: 21.5% for 6 channels.
ply ;4
pla ;4
rti ;7
.reload
dec <CounterTableIY
bpl .NoResetIY
lda #$5
sta <CounterTableIY
.NoResetIY
ldy <CounterTableIY
lda .CounterTable,y
sta <DriverCnt
jmp .re-entry
.CounterTable
.db 43,44,44,43,44,44
^- Just one example. With the indirect jump, you can do all kinds of setups.
So the base setup, no Hint FX, is 30% play 4 channels at 15.6khz. Of course, this doesn't include the frequency scaling. That's all done outside this routine. This is just one of many types of 15.6khz setups.
If you notice, the channel select operates take up some timing. That's unfortunate as they really didn't need to implement that type of system (they could have mapped all registers to their own unique addresses). But what that does mean, is that doing a 10bit paired channel output won't be much more resource (probably nearly the same). Just that it wouldn't be stereo, but you'd have 4 regular PCE channel that are stereo.
The same setup, but with 10bit paired would save 34 cycles over head per sample (two samples) initially. But software volume costs +5 cycles, and mixing costs +7. So 48 cycles total overhead for soft channels (10bit method), but 34 cycles saved, so +14 cycles more or 3% more cpu resource. Mono, but 4 frequency scaling channels at a higher bit depth (6bits per channel, or 7bits with a little bit more resource).
Anyway, so 30% for the "driver" part. If I can frequency scale four channels in 20% cpu or less (which should be doable), then I'll have hit my mark: 50% total cpu resource to playback 4 channels at 15.6khz. The quality should jump up dramatically. On the real system, there should be even more of a filter effect too since the playback rate will be higher. And on some emulators, it won't sound so good (they don't expect that high a DDA playback rate, and so will miss sample writes, etc). Ootake had a crappy artifact in DDA writes even at 7khz - did they fix that? As per usual, mednafen shouldn't have a problem.