Author Topic: PCE PCM (Read 10729 times)

esteban · « **Reply #45 on:** December 21, 2016, 02:02:55 PM »

Quote from: Bonknuts on December 21, 2016, 06:39:23 AM

I forgot how much better the PCE/SGX sounds through a stereo system. So much more bass-y-er and less tinny than emulation through TV or even earphones on the laptop. And the analog filtering makes is a bit softer on the real system too. I wish emulators could emulate that.

Yes, absolutely.

Now, some games, like the venerable China Warrior, have awesome bass (in main tune)... now, imagine how ridiculously awesome the bass is with SUBWOOFER.

touko · « **Reply #46 on:** December 21, 2016, 08:08:53 PM »

Quote from: Bonknuts on December 21, 2016, 06:26:52 AM

Ahh ok - I think know what Touko is talking about now. Touko can you post your code example?

Yes .
Sorry but the comments are in french (some english comments added )

User_Timer_Irq:
      stz    $1403            ; // RAZ TIMER
      pha
      phx
      phy

   ; // Evite de désactiver la voix si pas de sample
      bbs   #7 , <test_octet_voix1 , .fin_sample_voix1

      lda   #VOIX_DDA1            ; // Choix de la voix DDA 1
      sta   $800

      bbs   #0 , <test_octet_voix1 , .prep_octets_voix1

      lda   <sample_base_voix1
      cmp   <sample_taille_voix1
      bcc   .fin_comp1
      lda   <sample_base_voix1 + 1
      cmp   <sample_taille_voix1 + 1   ; // Si fin du sample
      bcc   .fin_comp1

      stz   $804               ; // Son à 0 sur la voix 1
      smb   #7 , <test_octet_voix1   ; // On déactive la lecture des sample pour la voix 1
      bra   .fin_sample_voix1

   ; // On Met en cache 3 samples(2 octets) pour la voix 1 et on lit le premier sample
     .fin_comp1:

     ; Mapping datas
      tma #3
      tax
     tma #4
      tay

      lda <sample_bank_voix1
      tam #3
      inc A
      tam #4

     ; Buffering the 2 bytes
      lda [ sample_base_voix1 ]
      sta   <cache_memory_voix1

      inc   <sample_base_voix1
      lda   [ sample_base_voix1 ]
      sta   <cache_memory_voix1 + 1

; Restoring the banks context
      txa
      tam   #3
      tya
      tam    #4

      lda   #3
      sta    <test_octet_voix1

      lda   <cache_memory_voix1 ; Reading the first sample

      inc   <sample_base_voix1
      bne   .transfert_data_sample_voix1
      inc   <sample_base_voix1 + 1

      bra   .transfert_data_sample_voix1

   ; // On lit le second sample
     .prep_octets_voix1:
      lda   <cache_memory_voix1 + 1 ; Reading the second sample

   ; // Si second sample lu, on décompresse le sample 3
      bbs   #1 , <test_octet_voix1 , .octet_suiv_voix1 ; If second sample was already send
; Decompressing the third sample then
      and    #$60
      lsr    A
      lsr    A
      sta   <cache_memory_voix1 + 1
      lda    <cache_memory_voix1
      lsr    A
      lsr    A
      lsr    A
      lsr    A
      lsr    A
      ora   <cache_memory_voix1 + 1

   ; // On decale le compteur de sample pour la voix 1
     .octet_suiv_voix1:
      lsr   <test_octet_voix1   ; Right shift of the sample to send

     .transfert_data_sample_voix1:
      sta $806

     .fin_sample_voix1:
      ply
      plx
      pla

      rti

There is 1 voice here,and i use a 2 bytes(3 samples because bits packed) buffer only .
After banking i 'am sending the first sample, next interrupt the second, and the next, i depacking and sending the third sample .

elmer · « **Reply #47 on:** December 22, 2016, 05:29:11 AM »

Quote from: touko on December 21, 2016, 08:08:53 PM

Sorry but the comments are in french (some english comments added )

Your English is far better than my French!

I'm sure that we can manage with a bit of help from Google Translate.

Thanks!

*************************

OK, so now I'm outputting some real music data to the PSG, and it's obvious that the 5-bit volume in PSG register 4 is not linear ... it drops off very, very quickly.

Does anyone have a calibrated liner-volume to PSG-volume lookup table to share?

I can earball it and come up with a rough approximation ... but I don't have an oscilloscope to create a proper one.

Gredler · « **Reply #48 on:** December 22, 2016, 05:36:53 AM »

Quote from: elmer on December 22, 2016, 05:29:11 AM

I can earball it

Hahah my new favorite term

Elmer is a earballer!

Bonknuts · « **Reply #49 on:** December 22, 2016, 06:10:16 AM »

Quote from: elmer on December 22, 2016, 05:29:11 AM

Quote from: touko on December 21, 2016, 08:08:53 PM
Sorry but the comments are in french (some english comments added )

Your English is far better than my French!

I'm sure that we can manage with a bit of help from Google Translate.

Thanks!

*************************

OK, so now I'm outputting some real music data to the PSG, and it's obvious that the 5-bit volume in PSG register 4 is not linear ... it drops off very, very quickly.

Does anyone have a calibrated liner-volume to PSG-volume lookup table to share?

I can earball it and come up with a rough approximation ... but I don't have an oscilloscope to create a proper one.

For channel volume, it's 1.5 dB drop per integer. For pan, it's 3.0 dB drop per integer. For main channel volume, 0 is -infinity but that's not true for pan! 0 for pan is not true silence.

Here's a chart I made for Amiga/XM linear to PCE: http://www.pcedev.net/blog/files/XM_volume_tables.txt

elmer · « **Reply #50 on:** December 22, 2016, 06:30:34 AM »

Quote from: Gredler on December 22, 2016, 05:36:53 AM

Hahah my new favorite term

Quote from: Bonknuts on December 22, 2016, 06:10:16 AM

For channel volume, it's 1.5 dB drop per integer. For pan, it's 3.0 dB drop per integer. For main channel volume, 0 is -infinity but that's not true for pan! 0 for pan is not true silence.

Here's a chart I made for Amiga/XM linear to PCE: http://www.pcedev.net/blog/files/XM_volume_tables.txt

Thanks!

touko · « **Reply #51 on:** December 22, 2016, 07:14:06 AM »

Quote

Your English is far better than my French!

Easy, but thanks

Bonknuts · « **Reply #52 on:** December 22, 2016, 08:52:19 AM »

Just looking through your code touko:

stz $804 ; // Son à 0 sur la voix 1
^- I wouldn't do this. If you're trying to silence the channel in DDA mode, simply don't write anything to it (also, writing the same sample value over and over in a row also have the same effect as silence). Otherwise this is going to give you a pop/click on the DAC at the end of a sample, on non SGX systems. It might not be noticeable in loud samples (that have a large amplitude pattern at the end), but ones that end with a built in fade (ending with a soft part) will make this pop/click more noticeable.

EDIT
Ok, I understand your compression scheme now. I hadn't seen that one before. It's pretty decent. A nice trade off between size and speed.

Bonknuts · « **Reply #53 on:** December 22, 2016, 11:46:39 AM »

Ok. So I mapped out your routine, hopefully with no errors, and I came up with this:

Reload takes 168 cycles -> first sample
Second sample takes 86 cycles
Third sample takes 102 cycles

There is ~116 calls per frame. And there are three phases to each sample: the lengths shown above. Each phase is a complete path from interrupt call, to output sample, to exit routine.

38.667 = 1/3rd. So 6,496 + 3,325 + 3,944 = 13,765 cycles.
or 11.5% cpu resource. That's not bad.

There's a different compression scheme that's a little tighter: it's 5 bytes long. First byte is the MSbit of all 5bit samples. Next 4bytes are pairs of 4bit samples.

Your scheme does 3 samples per every 2 bytes - with a 1 bit throw away, which is a division of 1.5. So 7000 samples, or 1 second, is 4,666bytes in length. The above compression scheme is 8 samples for 5bytes, which is a division of 1.6. So 7,000 samples or 1 second is 4,375bytes. Saves about 291bytes a second over the 3/2 method.

Is it faster or slower? I rewrote your decompression routine from scratch, using the same idea but just that I did a buffer of six samples instead of 3, and the total resource was 9.4%. Not really much lower than yours. So, just to be clear - I didn't use any long double buffer system. Just buffered 6 samples instead of 3, every 6th call.

For the alt compression scheme, 8 samples to 5 bytes, I did a sample buffer of 8 samples. So every 8th call in the TIMER would have to refill it. That ended up being 9.6% cpu resource. So 0.2% slower, but a slightly better compression ratio.

If I have time, I'll see how a double buffer system stacks up to those numbers.

elmer · « **Reply #54 on:** December 22, 2016, 04:27:13 PM »

Quote from: Bonknuts on December 22, 2016, 11:46:39 AM

Ok. So I mapped out your routine, hopefully with no errors, and I came up with this:

Reload takes 168 cycles -> first sample
Second sample takes 86 cycles
Third sample takes 102 cycles

There is ~116 calls per frame. And there are three phases to each sample: the lengths shown above. Each phase is a complete path from interrupt call, to output sample, to exit routine.

38.667 = 1/3rd. So 6,496 + 3,325 + 3,944 = 13,765 cycles.
or 11.5% cpu resource. That's not bad.

OK, I've had time to look at the problem now.

First, it's just my personal opinion, but while the 8x1+8x4 (5 byte) scheme that bonknuts mentioned is very clever, it's not worth the tiny 1/16 space-savings, and the complication that it adds, particularly when dealing with the end-of-sample condition.

From there, it's back to Touko's code ...

Mapping stuff into TAM #4 isn't needed, because Touko is already relying on his sample data being aligned on an even boundary, so I removed it, and got rid of the X and Y registers which were wasting cycles.

Then I rearranged the data format a bit to pack the current-state flags into the sample cache for a bit more speed, and used an end-of-sample marker in bit 15 of the sample word, which could also be used for sample-looping ...

... and the result is a 30% speed improvement where the banking takes an insignificant amount of the overall time.
; Three 5-bit Samples in 2-bytes (packet located on an even address boundary) ; ; 2-Byte Packed Data Format Packet + 0 : E332 2222 (sample 3 hi-bits) ; 2-Byte Packed Data Format Packet + 1 : 3331 1111 (sample 3 lo-bits) ; ; E = 1 if end-of-sample. ; ; Timing for sample 1 (122 cycles) if page-overflow ; Timing for sample 1 (110 cycles) if normal ; Timing for sample 2 ( 67 cycles) ; Timing for sample 3 ( 73 cycles) ; ; Time (normal 1 channel): 122 * 1 calls + ; 110 * 38 calls + ; 67 * 39 calls + ; 73 * 39 calls = 9762 cycles (8.2%)

Here's the code, but be warned ... it takes some concentration to follow the data flow and see that it should work ...

Code: [Select]

; ****************************************************************************
; ****************************************************************************
;
; Three 5-bit Samples in 2-bytes (packet located on an even address boundary)
;
; 2-Byte Packed Data Format Packet + 0  : E332 2222 (sample 3 hi-bits)
; 2-Byte Packed Data Format Packet + 1  : 3331 1111 (sample 3 lo-bits)
;
; E = 1 if end-of-sample.
;
; Timing for sample 1 (122 cycles) if page-overflow
; Timing for sample 1 (110 cycles) if normal
; Timing for sample 2 ( 67 cycles)
; Timing for sample 3 ( 73 cycles)
;
; Time (normal 1 channel): 122 *   1 calls +
;                          110 *  38 calls +
;                           67 *  39 calls +
;                           73 *  39 calls = 9762 cycles (8.2%)
;
; Maximum hsync delay:     122 cycles
;

User_Timer_Irq:                                 ; 8
        stz     $1403                           ; 5 RAZ TIMER
        pha                                     ; 3

        lda     #VOIX_DDA1                      ; 2 Choix de la voix DDA 1
        sta     $800                            ; 5

;       stz     $804                            ; - Son à 0 sur la voix 1

        lda     <cache_memory_voix1 + 0         ; 4 Bit 7 is set if we need to
        bpl     .byte2                          ; 2 refill the cache.

.byte1: tma3                                    ; 4
        pha                                     ; 3

        lda     <sample_bank_voix1              ; 4
        tam3                                    ; 5

        lda     [ sample_base_voix1 ]           ; 7 Read byte 0 of packed data.
        sta     <cache_memory_voix1 + 0         ; 4
        bmi     .end                            ; 2 Test "E" bit for end-of-sample.

        inc     <sample_base_voix1              ; 6
        lda     [ sample_base_voix1 ]           ; 7 Read byte 1 of packed data.
        sta     $806                            ; 5 Write sample #1.
        lsr     a                               ; 2 Clr bit 7 of byte 1, A=%0333xxxx.
        sta     <cache_memory_voix1 + 1         ; 4

        inc     <sample_base_voix1              ; 6 Deal with overflow to next page.
        beq     .page                           ; 2

.end:   pla                                     ; 4
        tam3                                    ; 5
        pla                                     ; 4
        rti                                     ; 7

.byte2: lda     <cache_memory_voix1 + 1         ; 4 Test bit 7 of byte 1.
        bmi     .byte3                          ; 2

        lsr     a                               ; 2
        sec                                     ; 2
        ror     a                               ; 2
        sta     <cache_memory_voix1 + 1         ; 4 Set bit 7 of byte 1, A=%100333xx.

        lda     <cache_memory_voix1 + 0         ; 4 Write sample #2.
        sta     $806                            ; 5

        pla                                     ; 4
        rti                                     ; 7

.byte3: lda     <cache_memory_voix1 + 0         ; 4 A=%033xxxxx
        and     #$60                            ; 2 A=%03300000
        ora     <cache_memory_voix1 + 1         ; 4 A=%133333xx
        sta     <cache_memory_voix1 + 0         ; 4 Set bit 7 to reload data next time.
        lsr     a                               ; 2
        lsr     a                               ; 2
        sta     $806                            ; 5

        pla                                     ; 4
        rti                                     ; 7

.page:  inc     <sample_base_voix1 + 1          ; 6 Deal with bank overflow.
        bpl     .end                            ; 4

.bank:  inc     <sample_bank_voix1              ; 6
        lda     #$60                            ; 2
        sta     <sample_base_voix1 + 1          ; 4

        pla                                     ; 4
        tam3                                    ; 5
        pla                                     ; 4
        rti                                     ; 7

touko · « **Reply #55 on:** December 22, 2016, 08:03:44 PM »

Quote

I wouldn't do this. If you're trying to silence the channel in DDA mode, simply don't write anything to it (also, writing the same sample value over and over in a row also have the same effect as silence). Otherwise this is going to give you a pop/click on the DAC at the end of a sample, on non SGX systems. It might not be noticeable in loud samples (that have a large amplitude pattern at the end), but ones that end with a built in fade (ending with a soft part) will make this pop/click more noticeable.

Yes i know, but i have not tested on a real PCE, only SGX,and i hesitate with muting balance (i don't know if pop is present with balance )

Quote

Ok, I understand your compression scheme now. I hadn't seen that one before. It's pretty decent. A nice trade off between size and speed.

Oh, I thought it was obvious to do like that ..

It's organised like that

Classic bytes organisation,you loose 3 bits/sample
AAAAA000 ; sample 1 => byte 1
BBBBB000 ; sample 2 => byte 2
CCCCC000 ; sample 3 => byte 3

you loose 1 bit for 3 samples
AAAAACCC ; byte 1
BBBBBCC0; byte 2

sampe A & B are send without any treatment,only sample C is shifted to a correct 5 bits sample .
It's fast (30 cycles) with a compression of 33%.

Quote

Mapping stuff into TAM #4 isn't needed, because Touko is already relying on his sample data being aligned on an even boundary, so I removed it, and got rid of the X and Y registers which were wasting cycles.

Hum i see the trick, but it's more complicated to see if you have reached the end of your sample .

Quote

I didn't use any long double buffer system. Just buffered 6 samples instead of 3, every 6th call.

Of course,long buffer is not my goal too and out of question, a 6 bytes can be descent, but no more .

Quote

Then I rearranged the data format a bit to pack the current-state flags into the sample cache for a bit more speed, and used an end-of-sample marker in bit 15 of the sample word, which could also be used for sample-looping ...

Clever ,and your bank overflow deal is clever too ..

Quote

Timing for sample 1 (122 cycles) if page-overflow
; Timing for sample 1 (110 cycles) if normal
; Timing for sample 2 ( 67 cycles)
; Timing for sample 3 ( 73 cycles)

The bonknut's results and yours show that i said

A little buffer decrease the need of mapping and by the way the average of CPU cycles ..

Quote

; Time (normal 1 channel): 122 * 1 calls +
; 110 * 38 calls +
; 67 * 39 calls +
; 73 * 39 calls = 9762 cycles (8.2%)

8.2% is really good, and there are some good ideas here to improve sample playing.

EDIT: I modified my PCM routine a little bit, now the sample finish with the bit 7 of the second byte (if 1), the mapping is now only on MPR3, and remap if a page overflow occur .
Thanks elmer, it's mush faster now ;-)

User_Timer_Irq:
      stz    $1403                  ; // RAZ TIMER
      pha

   ; // Evite de désactiver la voix si pas de sample
      bbs      #7 , <test_octet_voix1 , .fin_sample_voix1

      lda      #VOIX_DDA1            ; // Choix de la voix DDA 1
      sta      $800

      bbs      #0 , <test_octet_voix1 , .prep_octets_voix1

   ; // On Met en cache 3 samples(2 octets) pour la voix 1 et on lit le premier sample
     .fin_comp1:
      tma #3
      pha

      lda <sample_bank_voix1
      tam #3

      lda [ sample_base_voix1 ]
      sta      <cache_memory_voix1

      sta      $806            ; // On joue le sample 1 voix 1

      inc      <sample_base_voix1
      lda      [ sample_base_voix1 ]
      bmi      .voix_1_off      ; // Si bit 7 = 1 alors fin sample on sort

      sta      <cache_memory_voix1 + 1
   ; // On restaure l'ancienne bank
      pla
      tam    #3

      lda      #3
      sta    <test_octet_voix1

      inc      <sample_base_voix1
      bne      .fin_sample_voix1
      inc      <sample_base_voix1 + 1
      bpl      .fin_sample_voix1

      lda      #$60
      sta      <sample_base_voix1 + 1
      inc      <sample_bank_voix1

      bra      .fin_sample_voix1

     .voix_1_off:
   ; // On restaure l'ancienne bank
      pla
      tam    #3

      stz      $804               ; // Son à 0 sur la voix 1
      smb      #7 , <test_octet_voix1 ; // On déactive la lecture des sample pour la voix 1

      bra      .fin_sample_voix1

   ; // On lit le second sample
     .prep_octets_voix1:
      lda      <cache_memory_voix1 + 1

   ; // On decale le compteur de sample pour la voix 1
      lsr      <test_octet_voix1
   ; // Si second sample lu, on décompresse le sample 3
      bne      .transfert_data_sample_voix1
      and    #$60
      lsr    A
      lsr    A
      sta      <cache_memory_voix1 + 1
      lda    <cache_memory_voix1
      lsr    A
      lsr    A
      lsr    A
      lsr    A
      lsr    A
      ora      <cache_memory_voix1 + 1

     .transfert_data_sample_voix1:
      sta    $806

     .fin_sample_voix1:
pla
rti

Bonknuts · « **Reply #56 on:** December 23, 2016, 03:38:23 AM »

Quote from: elmer on December 22, 2016, 04:27:13 PM

OK, I've had time to look at the problem now.

First, it's just my personal opinion, but while the 8x1+8x4 (5 byte) scheme that bonknuts mentioned is very clever, it's not worth the tiny 1/16 space-savings, and the complication that it adds, particularly when dealing with the end-of-sample condition.

I found a few hucards games doing it, but specifically Street Fighter 2 (does it for two sample channels). It's faster than normal bitpacked sample that other PCE games tend to use. I've never seen the scheme touko used.

End of sample is convenient if you have the free bit for it, but it's not necessary. But that is awesome that you fitted it in there. I still like the padded end of sample block method (whatever that block is - 116 samples is the block in this case), where you check in vsync int.

I honestly thought the SF2 method would be faster. It looks simpler IMO (less steps/shifts).

Quote

Mapping stuff into TAM #4 isn't needed, because Touko is already relying on his sample data being aligned on an even boundary, so I removed it, and got rid of the X and Y registers which were wasting cycles.

I didn't catch that he was doing that. I shaved the 6 sample buffer method down to 8.9%, but buffer difference isn't big enough for the overhead (have a pointer of playback buffer) to over come the non pointer method you did with a 3 sample cache system.

Quote

Here's the code, but be warned ... it takes some concentration to follow the data flow and see that it should work ...

A block comment in assembly! I never seen anyone do that - haha. Your code is clean, clear, and and easy to understand. And fast. I'm impressed

Bonknuts · « **Reply #57 on:** December 23, 2016, 03:56:24 AM »

Touko: Do you need compression? Or are you just using compression on the samples because it's a waste not to? In other words, you expressed interest in the soft mix player I made - the only with 4 channels on two PCE channels. That wouldn't use any compression, because the sample depth is higher than 5bit. Would that mean you're not interested in it?

elmer · « **Reply #58 on:** December 23, 2016, 05:46:45 AM »

Quote from: touko on December 22, 2016, 08:03:44 PM

Thanks elmer, it's mush faster now ;-)

I'm glad to have helped!

I calculated the timings for your code vs mine and it comes out as ...
; Timing for sample 1 (130 cycles) if page-overflow ; Timing for sample 1 (122 cycles) if normal ; Timing for sample 2 ( 67 cycles) ; Timing for sample 3 ( 93 cycles) ; ; Timing for finished ( 35 cycles) ; ; Time (normal 1 channel): 130 * 1 calls + ; 122 * 38 calls + ; 67 * 39 calls + ; 93 * 39 calls = 11006 cycles (9.2%)

So you're 1244 cycles-per-frame slower than me, but your code is a bit clearer to follow.

Quote from: touko on December 22, 2016, 08:03:44 PM

Hum i see the trick, but it's more complicated to see if you have reached the end of your sample .

My point was that you don't need to waste 6 cycles by doing a test for sample-still-playing at the start of every one of the 116/117 interrupts ... it wastes 696/702 cycles-per-frame during normal playback.

It's easy to change my code to set a flag that your main loop can test ...
... lda [ sample_base_voix1 ] ; 7 Read byte 0 of packed data. sta <cache_memory_voix1 + 0 ; 4 bmi .finish ; 2 Test "E" bit for end-of-sample. ... .finish:stz <sample_playing ; 4 pla ; 4 tam3 ; 5 pla ; 4 rti ; 7

While it looks like my code wastes a lot of time re-reading the end-of-sample marker ... the point is that it doesn't matter in practice!

It takes no more time to do that than it does to actually play the sample ... which you already have to allow for in your game design.

The important thing to do, however you play back the sample, is to disable the timer interrupt entirely if you're not playing back any samples at all.

Quote from: touko on December 22, 2016, 08:03:44 PM

Of course,long buffer is not my goal too and out of question, a 6 bytes can be descent, but no more .

It would save you a few more cycles in the banking, but you may lose a lot of the time-savings in the extra code/branches to figure out which sample to output.

Bonknuts · « **Reply #59 on:** December 23, 2016, 06:20:42 AM »

Quote from: elmer on December 23, 2016, 05:46:45 AM

It takes no more time to do that than it does to actually play the sample ... which you already have to allow for in your game design.

^This. Worst case is what matters. Specifically when the occurrence is kinda random (you'd don't know when you're going to play a sample), so any resource in between 0 and expected ceiling - doesn't matter much IMO. Only the ceiling.

Author Topic: PCE PCM (Read 10729 times)

esteban

Re: PCE PCM

touko

Re: PCE PCM

elmer

Re: PCE PCM

Gredler

Re: PCE PCM

Bonknuts

Re: PCE PCM

elmer

Re: PCE PCM

touko

Re: PCE PCM

Bonknuts

Re: PCE PCM

Bonknuts

Re: PCE PCM

elmer

Re: PCE PCM

touko

Re: PCE PCM

Bonknuts

Re: PCE PCM

Bonknuts

Re: PCE PCM

elmer

Re: PCE PCM

Bonknuts

Re: PCE PCM