Author Topic: PCE PCM (Read 10679 times)

elmer · « **Reply #30 on:** December 18, 2016, 05:07:30 AM »

Quote from: Bonknuts on December 18, 2016, 03:46:17 AM

There are certain things I want to document, such as the volume regs not taking update immediately (the have a frequency range of something like 2khz - from testing with mednafen author). There's also a filtering effect starting around.. IIRC 6khz, which actually makes sample streaming sound a little better on the real system than emulators. Stuff like that.

The more information that's available, the better!

BTW ... can you tell me if accessing the PSG registers has the same 1-cycle-extra delay as the VDC registers?

In my new driver, I'm using a TIN instruction to update a channel's waveform, and would like to know if that's going to cause a 17+32*6 or 17+32*7 interrupt delay (it makes a difference, because I need to specifically disable interrupts during the update).

Bonknuts · « **Reply #31 on:** December 18, 2016, 06:05:44 AM »

From memory, anything from $000-7ff range in bank $FF has the extra cycle delay (no matter where it's mapped). So from memory, yes 17+32*7. I'll see if I can retest it today to verify.

elmer · « **Reply #32 on:** December 18, 2016, 06:16:43 AM »

Quote from: Bonknuts on December 18, 2016, 06:05:44 AM

From memory, anything from $000-7ff range in bank $FF has the extra cycle delay (no matter where it's mapped). So from memory, yes 17+32*7. I'll see if I can retest it today to verify.

OK, thanks!

If so, then I'll have to change the cycle-timings for the sample-playback interrupts, too.

<EDIT>

Hold on a second ... the PSG is at $0800, and reading Charles MacDonald's pcetech.txt seems to suggest that the extra-cycle only applies to the VDC and VCE ($0000-$07FF).

I *may* be OK.

elmer · « **Reply #33 on:** December 18, 2016, 07:42:00 AM »

Here is what the timings are looking like at-the-moment.

Writing to VDC ... tia $0000,VDC_DATA,32 ; 241 Self-modifying TIN instruction. Delay 241 cycles. Writing to PSG ... mz_update_wave: sei ; 2 No interrupts while writing PSG. stx PSG_R0 ; 5 Select PSG hardware channel. sta PSG_R4 ; 5 Reset this channel's read/write stz PSG_R4 ; 5 address. jsr mz_tin ; 7 Transfer waveform & enable IRQ. ... mz_tin: tin $0000,PSG_R6,32 ; 209 Self-modifying TIN instruction. cli ; 2 Allow an hsync/timer IRQ to run. rts Delay 235 cycles. Additional sample-playback delay ... tirq_ch234: ;;; ; 8 (cycles for the INT) stz $1403 ; 5 Acknowledge TIMER IRQ. cli ; 2 Allow HSYNC to interrupt. Delay 15 cycles. Maximum hsync delay 256 cycles.

********************

And bonknuts (or anyone else), is there some reason why the System Card IRQ1 handler delays processing the hsync interrupt by checking for a vsync interrupt first, and then delays things even more by doing 2 dummy BSR+RTS calls before changing a VDC register?

Has anyone looked at the timing of the hsync interrupt in relation to the VDC's latching of the scroll registers for the next display line?

Bonknuts · « **Reply #34 on:** December 18, 2016, 07:54:35 AM »

Hahaha. Sorry, I was thinking of the VCE for some reason :oops:

Bonknuts · « **Reply #35 on:** December 18, 2016, 08:02:51 AM »

About the sys card VDC handling routine; I dunno. But I never use it. I opt to use custom handling myself for everything VDC related (which is one jmp indirection, and.. maybe one BBx involved?). Or just straight out replacing the bank in MPR7 with something of my own.

Yeah, checking vsync first is totally ass backwards. The VDC interrupt handler should be optimized for hsync routine - who care about vsync and whatever small delay it gets. But then again, nothing in the sys card lib is really "optimal".

touko · « **Reply #36 on:** December 20, 2016, 09:23:54 PM »

i think if you want to decrease the CPU load when you're playing samples, a little buffer can do the job very well .
It's the banking which take the most cycles, and reduce it to 1 mapping for 4 samples(4 bytes) for exemple, help a lot .

Bonknuts · « **Reply #37 on:** December 21, 2016, 04:22:06 AM »

Quote

and reduce it to 1 mapping for 4 samples(4 bytes)

If switched to a buffer system, there would no mapping (the buffer should be in fixed system ram).

Doing a buffer system is faster, but it also has some requirements. It's going to require a two buffers in ram for all channels; a timer is 1024cycles between interrupt - you're going to copy 4x116bytes in 1024 cycles? Not gonna happen. Even just one channel gets too close for comfort (713cycles via Txx).

It's not just the bank mapping that the buffer system reduces. There's no MSB check on the buffer inside the TIRQ routine. Though that only saves you +2 cycles per sample, per channel. You could remove the EOF marker, and simply have all samples trail out zeros or $0f - both work (any value works, actually). So there's another +2 cycles per sample per channel saved.

Don't get me wrong; I use the double buffer system for my own stuff. But sometimes it's easier when you give other people functionality - to keep the interface a little more simple, and just eat a little overhead.

For a single channel buffer system; you'd save ~1.8% cpu overhead. For two channel buffer, you save ~2.2%. For four channel buffer, you save ~2.7%. It's not a whole lot. The reason being, is that mapping in a channel is only 9 cycles (lda <zp: tam #$nn). The larger overhead is from the tma #n:pha and pla:tam for saving the MPR. That's 16 cycles overhead, but that overheard basically gets divide down as more channels are output inside the routine. So the biggest cost savings is single channel use, relative to per channel savings.

Maybe I should be more clear; if you have 4 samples to stream - you don't map them into 4 individual banks. There's no reason to. You map them in sequential order, to the same MPR reg, as you use them. That way you only need to save/restore one bank for <n> number of channels to stream from. My above overhead savings assumes this. If it didn't, then you'd take the 1.8% and multiple that by the number of channels used as your total savings. But that shouldn't be the case.

What I do like about the buffer system, over the slight savings, is the flexibility of it. You can support both compressed samples and uncompressed samples. You could also support half frequency samples (3.5khz instead of 7khz; some sound FX actually sound decent at this playback rate. There are some PCE games that do this; playback samples at both rates). Do all kinds of stuff, and the main TIMER routine wouldn't have to know anything, other than what's in the buffer.

elmer · « **Reply #38 on:** December 21, 2016, 04:54:15 AM »

Quote from: touko on December 20, 2016, 09:23:54 PM

i think if you want to decrease the CPU load when you're playing samples, a little buffer can do the job very well .
It's the banking which take the most cycles, and reduce it to 1 mapping for 4 samples(4 bytes) for exemple, help a lot .

That's an interesting idea, there does seem to be quite a few cycles spent in the banking.

I'd like to have seen an example of how you'd actually accomplish that in practice.

You'd be adding some overhead in creating that buffer every 4th interrupt (and some extra instructions in *not* creating it for the other 3 interrupts).

So it's all going to be in the details, and in how you ensure that you don't keep interrupts disabled for too long.

Here's an example that I came up with that shows the *maximum* benefit that you could obtain by dropping *all* the banking from the interrupts, and just buffering up an entire frame's worth of samples in three 116-sample buffers in RAM.
; Three Channel Sample Playback. ; ; Time (normal 0 channel): 71 * 116 calls = 8236 cycles (6.91%) ; Time (normal 1 channel): 107 * 116 calls = 12412 cycles (10.41%) ; Time (normal 2 channel): 143 * 116 calls = 16588 cycles (13.91%) ; Time (normal 3 channel): 179 * 116 calls = 20764 cycles (17.42%) ; Time (worst 3 channel): 179 * 115 calls + ; 251 * 1 calls = 20836 cycles (17.48%) ; Three Channel Sample Playback with RAM buffer. ; ; Time (normal 0 channel): 55 * 116 calls = 6380 cycles (5.35%) ; Time (normal 1 channel): 80 * 116 calls = 9280 cycles (7.78%) ; Time (normal 2 channel): 105 * 116 calls = 12180 cycles (10.22%) ; Time (normal 3 channel): 130 * 116 calls = 15080 cycles (12.65%) ; Time (worst 3 channel): 130 * 115 calls + ; 151 * 1 calls = 15101 cycles (12.67%)

OK, here's the first part, and the gain looks good!

But you've then got to add the overhead for creating the RAM buffers.

When I do that, with the fastest TII code that I can think of, I get ...
; Three Channel Sample Playback with creating RAM buffer. ; ; Time (normal 0 channel): 6380 + 63 cycles = 6443 (5.40%) ; Time (normal 1 channel): 9280 + 924 cycles = 10204 (8.56%) ; Time (normal 2 channel): 12180 + 1785 cycles = 13965 (11.71%) ; Time (normal 3 channel): 15080 + 2646 cycles = 17726 (14.87%) ; Time (worst 3 channel): 15101 + 2646 cycles = 17747 (14.89%)

That's a 2.6% frame-time improvement at *best*, and I've not dealt with the issue of how to create those buffers safely without delaying the timer interrupt and causing an audio problem.

I'm not sure (yet) that the benefit is worth the cost.

I'd love to see what you can come up with!

Here's my code ...

Code: [Select]

;****************************************************************************
;
; Three Channel Sample Playback with RAM buffer.
;
; Time (normal 0 channel):  55 * 116 calls =  6380 cycles (5.35%)
; Time (normal 1 channel):  80 * 116 calls =  9280 cycles (7.78%)
; Time (normal 2 channel): 105 * 116 calls = 12180 cycles (10.22%)
; Time (normal 3 channel): 130 * 116 calls = 15080 cycles (12.65%)
; Time (worst  3 channel): 130 * 115 calls +
;                          151 *   1 calls = 15101 cycles (12.67%)
;
; Maximum hsync delay:     151 - 20 = 131 cycles

tirq_ch234:     ;;;                             ; 8 (cycles for the INT)
                stz     $1403                   ; 5 Acknowledge TIMER IRQ.
                cli                             ; 2 Allow HSYNC to interrupt.
                pha                             ; 3
                sei                             ; 2 Disable interrupts.

.channel2:      bbr2    <sample_flag,.channel3  ; 6
                lda     #2                      ; 2
                sta     PSG_R0                  ; 5
                lda     [sample2_ptr]           ; 7
                bmi     .eof2                   ; 2
                sta     PSG_R6                  ; 5
                inc     <sample2_ptr            ; 6

.channel3:      bbr3    <sample_flag,.channel4  ; 6
                lda     #3                      ; 2
                sta     PSG_R0                  ; 5
                lda     [sample3_ptr]           ; 7
                bmi     .eof3                   ; 2
                sta     PSG_R6                  ; 5
                inc     <sample3_ptr            ; 6

.channel4:      bbr4    <sample_flag,.done      ; 6
                lda     #4                      ; 2
                sta     PSG_R0                  ; 5
                lda     [sample4_ptr]           ; 7
                bmi     .eof4                   ; 2
                sta     PSG_R6                  ; 5
                inc     <sample4_ptr            ; 6

.done:          pla                             ; 4
                rti                             ; 7

.eof2:          stz     PSG_R4                  ; 5
                rmb2    <sample_flag            ; 7
                bra     .channel3               ; 4

.eof3:          stz     PSG_R4                  ; 5
                rmb3    <sample_flag            ; 7
                bra     .channel4               ; 4

.eof4:          stz     PSG_R4                  ; 5
                rmb4    <sample_flag            ; 7
                bra     .done                   ; 4


;****************************************************************************
;
; Three Channel Sample Playback with creating RAM buffer.
;
; Time (normal 0 channel):  6380 +   63 cycles =  6443 (5.40%)
; Time (normal 1 channel):  9280 +  924 cycles = 10204 (8.56%)
; Time (normal 2 channel): 12180 + 1785 cycles = 13965 (11.71%)
; Time (normal 3 channel): 15080 + 2646 cycles = 17726 (14.87%)
; Time (worst  3 channel): 15101 + 2646 cycles = 17747 (14.89%)
;

buffer_samples: tma3                            ; 4
                pha                             ; 3
                tma4                            ; 4
                pha                             ; 3

; Prepare channel 2's sample 116-byte buffer.

.channel2:      bbr2    <sample_flag,.channel3  ; 6

                lda     s2_bnk                  ; 5
                tam3                            ; 3
                inc     a                       ; 2
                tam4                            ; 3
.smod0:         tii     s2_ptr,s2_buf+$00,32    ; 209
.smod1:         tii     s2_ptr,s2_buf+$20,32    ; 209
.smod2:         tii     s2_ptr,s2_buf+$40,32    ; 209
.smod3:         tii     s2_ptr,s2_buf+$60,20    ; 137
                lda     smod0+1                 ; 5   lo
                ldy     smod0+2                 ; 5   hi

                clc                             ; 2
                adc     #116                    ; 2
                sta     smod0+1                 ; 5   lo
                bcc     .addr0                  ; 2
                iny                             ; 2
                bpl     .addr0                  ; 4
                inc     s2_bnk                  ; -
                ldy     #$60                    ; -

.addr0:         sty     smod0+2                 ; 5    hi
                clc                             ; 2
                adc     #32                     ; 2
                sta     smod1+1                 ; 5    lo
                bcc     .addr1                  ; 2
                iny                             ; 2
.addr1:         sty     smod1+2                 ; 5    hi

                clc                             ; 2
                adc     #32                     ; 2
                sta     smod2+1                 ; 5    lo
                bcc     .addr2                  ; 2
                iny                             ; 2
.addr2:         sty     smod2+2                 ; 5    hi

                clc                             ; 2
                adc     #32                     ; 2
                sta     smod3+1                 ; 5    lo
                bcc     .addr3                  ; 2
                iny                             ; 2
.addr3:         sty     smod3+2                 ; 5    hi

; Prepare channel 3's sample 116-byte buffer.

.channel3:      bbr3    <sample_flag,.channel4  ; 6
                ...

; Prepare channel 3's sample 116-byte buffer.

.channel4:      bbr4    <sample_flag,.done      ; 6
                ...

.done:          pla                             ; 4
                tam4                            ; 5
                pla                             ; 4
                tam3                            ; 5
                rts                             ; 7

elmer · « **Reply #39 on:** December 21, 2016, 05:25:00 AM »

Quote from: Bonknuts on December 21, 2016, 04:22:06 AM

Doing a buffer system is faster, but it also has some requirements. It's going to require a two buffers in ram for all channels; a timer is 1024cycles between interrupt - you're going to copy 4x116bytes in 1024 cycles? Not gonna happen.

Yeah, I was trying to avoid the double-buffer, but I'm not sure that I can easily do so.

You *could* interleave the buffer updates, i.e. update the 1st 16-bytes of each channel within a single TIRQ time-period, and then update the rest ... but your code is getting *excessively* timing-dependant at that point.

Quote

Don't get me wrong; I use the double buffer system for my own stuff. But sometimes it's easier when you give other people functionality - to keep the interface a little more simple, and just eat a little overhead.

Ahhh ... OK, so that's why you're so keen on keeping a consistent 116 interrupts-per frame and resyncing the timer every vsync!

The simple code doesn't really care whether there are 116 or 117 interrupts in a frame, or the exact synchronization.

Yeah ... the more that I think about it, if the target is a generic sound driver that could be used in HuC as a replacement for the System Card Player, then I'd prefer to keep things simple-but-reliable, and accept the 2..3% CPU hit.

Quote

What I do like about the buffer system, over the slight savings, is the flexibility of it. You can support both compressed samples and uncompressed samples. You could also support half frequency samples (3.5khz instead of 7khz; some sound FX actually sound decent at this playback rate. There are some PCE games that do this; playback samples at both rates). Do all kinds of stuff, and the main TIMER routine wouldn't have to know anything, other than what's in the buffer.

All good points ... but I'll leave that for the "advanced" developers like you!

touko · « **Reply #40 on:** December 21, 2016, 05:33:13 AM »

No needs a big buffer, a 4 bytes buffer /voice is enough,you need to map datas only 1 time for 4 samples .
Banking datas each time is 50/60 cycles / sample, it's 100/120 cycles /sample for 2 voices .
For 8 samples (4 samples/voice) you lost 400/480 cycles vs only 100/120 with banking.
You can reduce drastically the CPU load in your frame .

you can also doing a bit packing to reduce the need of mapping(and also reduce the sample size by 1/3), 3 samples in 2 bytes .

elmer · « **Reply #41 on:** December 21, 2016, 06:11:36 AM »

Quote from: touko on December 21, 2016, 05:33:13 AM

No needs a big buffer, a 4 bytes buffer /voice is enough,you need to map datas only 1 time for 4 samples.

Show a code example, please.

Quote

Banking datas each time is 50/60 cycles / sample, it's 100/120 cycles /sample for 2 voices .
For 8 samples (4 samples/voice) you lost 400/480 cycles vs only 100/120 with banking.
You can reduce drastically the CPU load in your frame .

You're not making any sense.

Please show some code example of why you think this is so.

The code that I posted earlier has a banking overhead of ...
1 channel sample playback = 27 cycles per timer interrupt 2 channel sample playback = 38 cycles per timer interrupt 3 channel sample playback = 49 cycles per timer interrupt

And 3 channels is the maximum before I'd have to re-enable interrupts or risk delaying an hsync too much.

Are you seeing something wrong in the code that I posted?

Quote

you can also doing a bit packing to reduce the need of mapping(and also reduce the sample size by 1/3), 3 samples in 2 bytes .

Yes, you can, at the cost of more overhead, and more cycles.

Again ... please show a code example of how you're doing all of this without overhead, or provide some timing calculations to show the cost.

<edit>

OK, my code was actually in the MML thread, so here's the latest 3 channel version for reference ...
;**************************************************************************** ; ; Three Channel Sample Playback. ; ; Time (normal 0 channel): 71 * 116 calls = 8236 cycles (6.91%) ; Time (normal 1 channel): 107 * 116 calls = 12412 cycles (10.41%) ; Time (normal 2 channel): 143 * 116 calls = 16588 cycles (13.91%) ; Time (normal 3 channel): 179 * 116 calls = 20764 cycles (17.42%) ; Time (worst 3 channel): 179 * 115 calls + ; 251 * 1 calls = 20836 cycles (17.48%) ; ; Maximum hsync delay: 251 - 25 = 226 cycles tirq_ch234: ;;; ; 8 (cycles for the INT) stz $1403 ; 5 Acknowledge TIMER IRQ. cli ; 2 Allow HSYNC to interrupt. pha ; 3 tma3 ; 4 pha ; 3 sei ; 2 Disable interrupts. .channel2: bbr2 <sample_flag,.channel3 ; 6 lda <sample2_bnk ; 4 tam3 ; 5 lda #2 ; 2 sta PSG_R0 ; 5 lda [sample2_ptr] ; 7 bmi .eof2 ; 2 sta PSG_R6 ; 5 inc <sample2_ptr ; 6 beq .msb2 ; 2 .channel3: bbr3 <sample_flag,.channel4 ; 6 lda <sample3_bnk ; 4 tam3 ; 5 lda #3 ; 2 sta PSG_R0 ; 5 lda [sample3_ptr] ; 7 bmi .eof3 ; 2 sta PSG_R6 ; 5 inc <sample3_ptr ; 6 beq .msb3 ; 2 .channel4: bbr4 <sample_flag,.done ; 6 lda <sample4_bnk ; 4 tam3 ; 5 lda #4 ; 2 sta PSG_R0 ; 5 lda [sample4_ptr] ; 7 bmi .eof4 ; 2 sta PSG_R6 ; 5 inc <sample4_ptr ; 6 beq .msb4 ; 2 .done: pla ; 4 tam3 ; 5 pla ; 4 rti ; 7 .msb2: inc <sample2_ptr+1 ; 6 bpl .channel4 ; 2 inc <sample2_bnk ; 6 lda #$60 ; 2 sta <sample2_ptr+1 ; 4 bra .channel3 ; 4 .msb3: inc <sample3_ptr+1 ; 6 bpl .channel4 ; 2 inc <sample3_bnk ; 6 lda #$60 ; 2 sta <sample3_ptr+1 ; 4 bra .channel4 ; 4 .msb4: inc <sample4_ptr+1 ; 6 bpl .done ; 2 inc <sample4_bnk ; 6 lda #$60 ; 2 sta <sample4_ptr+1 ; 4 bra .done ; 4 .eof2: stz PSG_R4 ; 5 rmb2 <sample_flag ; 7 bra .channel3 ; 4 .eof3: stz PSG_R4 ; 5 rmb3 <sample_flag ; 7 bra .channel4 ; 4 .eof4: stz PSG_R4 ; 5 rmb4 <sample_flag ; 7 bra .done ; 4

Bonknuts · « **Reply #42 on:** December 21, 2016, 06:26:52 AM »

Ahh ok - I think know what Touko is talking about now. Touko can you post your code example?

Bonknuts · « **Reply #43 on:** December 21, 2016, 06:39:23 AM »

I forgot how much better the PCE/SGX sounds through a stereo system. So much more bass-y-er and less tinny than emulation through TV or even earphones on the laptop. And the analog filtering makes is a bit softer on the real system too. I wish emulators could emulate that.

Anyway, here's my batch of 7khz sample scaling vs 14khz sample scaling. On the real system, the 14khz performs better than on emulators thanks to the analog filtering. It's still not a big difference, or as much as I expected, going with double the frequency. But there is more 'punch' to some of the samples. Or at least on my stereo system. http://www.pcedev.net/HuPCMDriver/7khz_and_14khz.zip <- try them out on the real system (not emulator).

elmer · « **Reply #44 on:** December 21, 2016, 09:37:55 AM »

Quote from: Bonknuts on December 21, 2016, 06:26:52 AM

Ahh ok - I think know what Touko is talking about now. Touko can you post your code example?

Are you thinking that touko is talking about your multichannel PCM driver?

The one that you've said takes 12% CPU to mix 8 PCM channels into 2 PSG channels?

http://www.pcenginefx.com/forums/index.php?topic=20035.msg464140#msg464140

Now that I've seen the cost of 1/2/3 PSG sample-channels, something like that starts to sound quite tempting!

Author Topic: PCE PCM (Read 10679 times)

elmer

Re: PCE PCM

Bonknuts

Re: PCE PCM

elmer

Re: PCE PCM

elmer

Re: PCE PCM

Bonknuts

Re: PCE PCM

Bonknuts

Re: PCE PCM

touko

Re: PCE PCM

Bonknuts

Re: PCE PCM

elmer

Re: PCE PCM

elmer

Re: PCE PCM

touko

Re: PCE PCM

elmer

Re: PCE PCM

Bonknuts

Re: PCE PCM

Bonknuts

Re: PCE PCM

elmer

Re: PCE PCM