Author Topic: MML: What are people's actual complaints with the damn thing (Read 7409 times)

Arkhan · « **Reply #135 on:** December 12, 2016, 07:20:59 AM »

Is it slower on 6280 vs. 6502? 6502.org lists it as 6.

elmer · « **Reply #136 on:** December 12, 2016, 07:44:40 AM »

Quote from: Arkhan on December 12, 2016, 07:20:59 AM

Is it slower on 6280 vs. 6502? 6502.org lists it as 6.

Lots of stuff is a cycle slower on the 6280 ... but lots of stuff is the same.

I have no idea why there's an extra cycle on an RTI, but it's documented.

The timings in the cribsheets that can be downloaded from bonknut's blog are accurate.

Bonknuts · « **Reply #137 on:** December 12, 2016, 08:01:47 AM »

14 was the return total (pla, tam, and rti)
The int call is 8 cycles, the RTI is 7 cycles. That's the minimum you can possibly have on the PCE.

Anyway, look over this:

I changed a few things around. "in_progress" can be used as both temporary bypass (sample currently being fetch/played), or to complete stop sample playback (because it can't undo itself if the condition is set outside the IRQ routine).

I also decided to wrap the timer decrement inside a protect area (interrupts disabled). Just in case some wild edge case scenario showed its ugly head. You don't want a second call (overlapping TIRQ) to decement the counter while the previous call was in the middle of updating (resetting it back). The question is, would that be a bad thing if it did happen?

So for stalling H-int, the longest is 44 cycles, mid is 30 cycles, and minimum is 22 cycles. The code is a little difficult to follow at first, because I'm trying to optimize for few case cycles per delay. You might notice that if a sample is going to be played, interrupts are enabled during that process, but disabled for the counter. If a sample isn't played, the branch that skips it - leaves interrupts disabled until the counter decrement can finish its process.

Actually, worse case scenario of H-int delay would be 45 cycles; no sample to play, decrement counter but no psg player call, exit routine.

So, you want overhead? Just the overhead of the decrement counter and no sample playing or player call? 45 cycles per call (a call being TIMER set to max speed). If you re-sync TIMER on internal vsync interrupt, you get a nice integer of 116 calls per frame (and no drift of the psg player relative to the frame - async.. eww). 116*45 = 5220 cycles or 4.38% cpu resource overhead.

Note: I think you stated something like 433 cpu cycles per scanline. I'm not sure if you were talking absolute or relative to overhead: VDC scanline is 455 cpu cycles long (the frame rate of 262 mode is something like ~60.1hz).

Quote

Is it slower on 6280 vs. 6502? 6502.org lists it as 6.

Some things are +1 cycle longer on the PCE (but unlike the 65x, there are no page boundary penalties.)
http://www.pcedev.net/blog/files/Otaku_no_PCE_cribsheet_page2_0_1_4.png

EDIT
Opps. I forgot to update ".EOF". It should read like this:
pla
tam #$04
jmp .TIMER_sampledisable

Where .TIMER_sampledisable is a label point right after stz <in_progress.

ccovell · « **Reply #138 on:** December 12, 2016, 11:25:03 AM »

Slightly apropos of this topic, MML, case-sensitivity, and Z80 stuff, etc., did you hear about the MML driver that's in Alpha's Neo-Geo games? I did a dissection, explanation, videos, etc. here: http://www.chrismcovell.com/ADKMML.html

Who knows, maybe the overloaded MML syntax might give some ideas for customization for the PCE hardware.

elmer · « **Reply #139 on:** December 12, 2016, 01:32:08 PM »

Quote from: Bonknuts on December 12, 2016, 08:01:47 AM

The int call is 8 cycles, the RTI is 7 cycles. That's the minimum you can possibly have on the PCE.

Quote

Note: I think you stated something like 433 cpu cycles per scanline. I'm not sure if you were talking absolute or relative to overhead: VDC scanline is 455 cpu cycles long (the frame rate of 262 mode is something like ~60.1hz).

Thanks, those are both really useful information!

So you're basing your percentages on 119210 cycles-per-frame?

Quote from: Bonknuts on December 12, 2016, 08:01:47 AM

Anyway, look over this:

That looks good!

I'd be tempted to do things a *little* differently if I were going to try to implement the permanent-timer and music-driver-running-from-the-timer, especially in the case of HuC.

From what I've seen, HuC maps the main library bank to both MPR6 and MPR7, probably in order to maintain a consistent memory layout between HuCard and CD-ROM, and have the the library code at $C000.

In that case (and in whatever case if I were writing purely in assembly), I'd probably prefer to map RAM into MPR7 so that I could change the interrupt vectors at runtime.

If you do that, then your timer routine depends upon how many channels you want to play, and your idle-case for the driver gets a lot faster, something like this ...

tirq_none: ;;; ; 8 (cycles for the INT) stz $1403 ; 5 Acknowledge TIMER IRQ. dec <driver_cnt ; 6 beq .driver_s ; 2 rti ; 7

That's just 28 cycles-per-irq, and 3248 cycles-per-frame, for just a 2.7% CPU overhead.

There is one thing ... I don't think that your sample-playback code is safe from something in a vsync or hsync handler changing the value of sample_bank or sample_ptr.

If you run the music-driver from the timer interrupt, then you'll probably get away with that, but you're may start getting problems if the music-driver runs in the vsync.

Of course ... the big thing here is that we're both still messing-around and trying to come up with solutions that work when the music-driver is called more-often that 60Hz.

I'd really rather not do that.

Bonknuts · « **Reply #140 on:** December 12, 2016, 02:30:00 PM »

Quote from: elmer on December 12, 2016, 01:32:08 PM

That's just 28 cycles-per-irq, and 3248 cycles-per-frame, for just a 2.7% CPU overhead.

It's just a template to work with. Nice optimization though

If you're going to be playing samples for music, and SFX, personally I'd rather optimize for worse case scenario. If the max is 11% cpu resource, that's what I'm going to figure in when I'm allocating cpu resource per frame. Anything that fluctuates below that, such as idle cost, isn't going to help me. You have something else in mind?

Quote

There is one thing ... I don't think that your sample-playback code is safe from something in a vsync or hsync handler changing the value of sample_bank or sample_ptr.

If you run the music-driver from the timer interrupt, then you'll probably get away with that, but you're may start getting problems if the music-driver runs in the vsync.

You're totally on your game

I didn't put it in the code because we hadn't talked that far yet: There are some questions that need to be answered first: is the PSG player going to issue/handle SFX? I.e. SFX is handled through the PSG player or outside of it. I'd assume sampled instruments would be handled inside of it.

I ran into this exact problem with my 4 channel PCM driver and how to change samples without overriding something that's in the middle of the TIRQ routine. I came up with a processing system. You send requests for driver updates (sample stuff), and the TIMER routine itself called this little "processing routine" to safely update sample regs - every 116 down count (another reason to re-sync TIMER immediately in vsync int).

But yeah, there needs to be window provided to safely update sample reg stuff of the TIMER routine. If it's done through the PSG player (or whatever music "engine" is called in its place), then that would solve a lot of problems. And likewise, SFX should probably go through it. But you can do a buffer/processor system with a hook in vsync to safely update it too (as long as TIMER isn't async).

I attached the "processor update" code to the TIMER itself with a downcount of 116, only because my 4 channel PCM driver is software frequency scaling each channel in realtime (and can stream samples to all 6 channels: 4 with frequency scaling, two fixed frequency for SFX). So it came dangerously close to not finishing in time for worse case scenario on vsync interrupt. Being this is just a simple sample stream routine, you could easily and safely put the "processor update" code in the vsync int routine. As in - directly inside the vsync handler, not a manual call in HuC itself after a vsync().

Bonknuts · « **Reply #141 on:** December 12, 2016, 02:35:39 PM »

Also, I vote we call this "sample player" addition code as... "Muthaf*cking Juan Carlos". Because this shit's gonna rule!

elmer · « **Reply #142 on:** December 12, 2016, 02:51:20 PM »

Quote from: Bonknuts on December 11, 2016, 05:19:56 AM

To play a sample via writing to a single DDA channel, at 6991 calls per second, takes 11% cpu resource. It's 5% if no sample is playing (just down counting).

Here are the versions that I've come up with for running the music-driver in the vblank interrupt, and then allowing for either a one-channel-only, or a two-channel-capable timer IRQ handler.

The timings are ...

|---------------------------------------------------------------------------------| | handler \ playing | 0-channels | 1-channel | 2-channel | |---------------------------------------------------------------------------------| | one-channel-only | 6380cyc, 5.35% | 10582cyc, 8.85% | | | two-channel-able | 7308cyc, 6.13% | 11484cyc, 9.63% | 15712cyc, 13.18% | |---------------------------------------------------------------------------------|

But sensibly, if you were updating the music-driver in the vsync, and weren't actually playing any samples, then you'd just disable the timer interrupt completely and have a 0-cycyle, 0% cost.

It's such a minimal extra-cost to go for the two-channel-capable driver that it seems like it would be well worth-it.

13.18% worst-case cost for two channels of samples isn't stunningly great, but it doesn't seem too bad, either.

Here's the code, and please let me know if anyone sees any problems/mistakes ...

Code: [Select]

;****************************************************************************
;
; One Channel Sample Playback
;
; Time (normal 0 channel):  55 * 116 calls =  6380 cycles (5.35%)
; Time (normal 1 channel):  91 * 116 calls = 10556 cycles (8.85%)
; Time (worst  1 channel):  91 * 115 calls +
;                          117 *   1 calls = 10582 cycles (8.88%)

tirq_ch4:       ;;;                             ; 8 (cycles for the INT)
                stz     $1403                   ; 5 Acknowledge TIMER IRQ.
                cli                             ; 2 Allow HSYNC to interrupt.
                pha                             ; 3
                tma3                            ; 4
                pha                             ; 3
                sei                             ; 2 Disable interrupts.

.channel4:      bbr4    <sample_flag,.done      ; 6
                lda     <sample4_bnk            ; 4
                tam3                            ; 5
                lda     #4                      ; 2
                sta     PSG_R0                  ; 5
                lda     [sample4_ptr]           ; 7
                bmi     .eof4                   ; 2
                sta     PSG_R6                  ; 5
                inc     <sample4_ptr            ; 6
                beq     .msb4                   ; 2

.done:          pla                             ; 4
                tam3                            ; 5
                pla                             ; 4
                rti                             ; 7

.msb4:          inc     <sample4_ptr+1          ; 6
                bpl     .done                   ; 2
                inc     <sample4_bnk            ; 6
                lda     #$60                    ; 2
                sta     <sample4_ptr+1          ; 4
                bra     .done                   ; 4

.eof4:          stz     PSG_R4                  ; 5
                rmb4    <sample_flag            ; 7
                bra     .done                   ; 4


;****************************************************************************
;
; Two Channel Sample Playback
;
; Time (normal 0 channel):  63 * 116 calls =  7308 cycles (6.13%)
; Time (normal 1 channel):  99 * 116 calls = 11484 cycles (9.63%)
; Time (normal 2 channel): 135 * 116 calls = 15660 cycles (13.14%)
; Time (worst  2 channel): 135 * 115 calls +
;                          187 *   1 calls = 15712 cycles (13.18%)

tirq_ch34:      ;;;                             ; 8 (cycles for the INT)
                stz     $1403                   ; 5 Acknowledge TIMER IRQ.
                cli                             ; 2 Allow HSYNC to interrupt.
                pha                             ; 3
                tma3                            ; 4
                pha                             ; 3
                sei                             ; 2 Disable interrupts.

.channel3:      bbr3    <sample_flag,.channel4  ; 6
                lda     <sample3_bnk            ; 4
                tam3                            ; 5
                lda     #3                      ; 2
                sta     PSG_R0                  ; 5
                lda     [sample3_ptr]           ; 7
                bmi     .eof3                   ; 2
                sta     PSG_R6                  ; 5
                inc     <sample3_ptr            ; 6
                beq     .msb3                   ; 2

.channel4:      bbr4    <sample_flag,.done      ; 6
                lda     <sample4_bnk            ; 4
                tam3                            ; 5
                lda     #4                      ; 2
                sta     PSG_R0                  ; 5
                lda     [sample4_ptr]           ; 7
                bmi     .eof4                   ; 2
                sta     PSG_R6                  ; 5
                inc     <sample4_ptr            ; 6
                beq     .msb4                   ; 2

.done:          pla                             ; 4
                tam3                            ; 5
                pla                             ; 4
                rti                             ; 7

.msb3:          inc     <sample3_ptr+1          ; 6
                bpl     .channel4               ; 2
                inc     <sample3_bnk            ; 6
                lda     #$60                    ; 2
                sta     <sample3_ptr+1          ; 4
                bra     .channel4               ; 4

.msb4:          inc     <sample4_ptr+1          ; 6
                bpl     .done                   ; 2
                inc     <sample4_bnk            ; 6
                lda     #$60                    ; 2
                sta     <sample4_ptr+1          ; 4
                bra     .done                   ; 4

.eof3:          stz     PSG_R4                  ; 5
                rmb3    <sample_flag            ; 7
                bra     .channel4               ; 4

.eof4:          stz     PSG_R4                  ; 5
                rmb4    <sample_flag            ; 7
                bra     .done                   ; 4

Bonknuts · « **Reply #143 on:** December 12, 2016, 04:01:25 PM »

Quote from: elmer on December 12, 2016, 02:51:20 PM

13.18% worst-case cost for two channels of samples isn't stunningly great, but it doesn't seem too bad, either.

Well, compare that to Air Zonk. Air Zonk is almost 30% cpu resource max (and it hits that max every time a sample is playing)! Sure, it's decompressing samples - but.. 30%! Pfft. ~13% is fine.

Yeah, once you have the overhead of everything, adding additional channels for sample streaming doesn't increase it a whole lot.

Just some thoughts:
On your 2nd channel one. I would add a cli, nop, sei right after .channel4. Your worst case scenario for each channel is 68 cycle delay for H-int, which would probably be fine, but I wouldn't push it with twice that in a worse case scenario.

Also, by not having a busy flag system (have interrupts open for the whole thing)- you're setup is going to be little less friendly with code using small Txx in 32byte segments during active display - and worse case scenarios in all settings (H-int and TIRQ). Just something to note. Might want to recommend or write block transfers with 16byte or 8byte segments with Txx.

These following ideas might not be popular for HuC, but I'll mention them anyway:

It won't save a whole lot, but if you pad sample with some 0's (assume all samples are made of 116 byte segments). You can run a segment length counter in vsync, and remove the BMI to .EOF check, it only saves 2 cycles per channel though.

Likewise, you can speed up if you're using two or more channels - if you store 116 samples as 128 byte segments (bytes 117 to 127 are null - do nothing). And then align them to 128byte boundary in rom/memory. A little bit a growth (the size of 7.68khz but the playback of 7khz), but now you don't need to check bank boundary crossing in the TIRQ routine itself - AND.. all samples regardless of their memory address can all use one Y index reg value. Incrementing to the next 128 segment would be done during vsync or on the last 116th call. Stick the TIRQ routine inside of ram, and you save some more cycles with self-modifying code. What's the thing take up now? 200 bytes?

Something I'm curious about: Why channel's 3 and 4? Why not channels 0 and 1? Channel 0 saves you 2 cycles. And leaving channels 4 and 5 free allow noise mode for both of those while samples are playing. 7khz is pretty good for almost all drumkit samples, but not so great at short/closed hi-hats. Noise channel is good for those. Just some thoughts.

Arkhan · « **Reply #144 on:** December 12, 2016, 04:30:04 PM »

Quote from: Bonknuts on December 12, 2016, 02:35:39 PM

Also, I vote we call this "sample player" addition code as... "Muthaf*cking Juan Carlos". Because this shit's gonna rule!

I vote we move the assembly sample playing bromance all out of the MML thread to somewhere else because it's probably scaring people away from MML and basically has dickall to do with MML, thus clogging the thread up with something else.

I think there's already a PCM thread.

http://www.pcenginefx.com/forums/index.php?topic=21695.0

Arkhan · « **Reply #145 on:** December 12, 2016, 04:37:31 PM »

Quote from: ccovell on December 12, 2016, 11:25:03 AM

Slightly apropos of this topic, MML, case-sensitivity, and Z80 stuff, etc., did you hear about the MML driver that's in Alpha's Neo-Geo games? I did a dissection, explanation, videos, etc. here: http://www.chrismcovell.com/ADKMML.html

Who knows, maybe the overloaded MML syntax might give some ideas for customization for the PCE hardware.

Lol, that damn Magician Lord sample. It's so goofy.

I'd be curious what they composed with back then. I wonder if they just typed the music in after composing it elsewhere, or what

It's good for the most part that they stayed fairly standard, but I do not see why they went with > and < meaning backwards things.

I wonder if it was on accident?

I would personally like to avoid introducing the concept of case sensitivity because that will be a recipe for disaster and annoyance, but we have definitely already added some PCE specific things to Squirrel.

elmer · « **Reply #146 on:** December 12, 2016, 04:40:14 PM »

Quote from: ccovell on December 12, 2016, 11:25:03 AM

Slightly apropos of this topic, MML, case-sensitivity, and Z80 stuff, etc., did you hear about the MML driver that's in Alpha's Neo-Geo games? I did a dissection, explanation, videos, etc. here: http://www.chrismcovell.com/ADKMML.html

That's so darned amazingly cool, I wasn't aware of that at all. Thanks for posting that!

Quote from: Arkhan on December 12, 2016, 04:30:04 PM

I vote we move this all out of the MML thread because it's probably scaring people away from MML and basically has dickall to do with MML.

Hahaha ... just don't bring this sample-sh*t into the HuC thread!!!

Quote from: Bonknuts on December 12, 2016, 02:30:00 PM

There are some questions that need to be answered first: is the PSG player going to issue/handle SFX? I.e. SFX is handled through the PSG player or outside of it. I'd assume sampled instruments would be handled inside of it.

Any sane music-driver for a console also needs to handle sound-effects.

I have no idea what modifications it would take to hack the System Card Player to play samples as either drums, or as SFX.

Honestly ... I just don't care.

I'd much-rather have an open-source replacement (with a clean copyright status) that could play samples.

From the POV of my driver that I'm converting at the moment ... "yes", both sound effect and sample support are handled.

I do not currently, and probably won't-ever, handle the case of restarting/continuing a music-channel sample that gets cut-off by a sound-effect sample.

Fixed-rate samples in music-channels are normally limited to drums and short things like that where you don't object to dropping-out a single hit/note.

But if someone cares-enough about the capability ... that's why stuff like this should be open-source these days ... so that they can go and add it themselves.

Quote

I ran into this exact problem with my 4 channel PCM driver and how to change samples without overriding something that's in the middle of the TIRQ routine. I came up with a processing system.
...
But yeah, there needs to be window provided to safely update sample reg stuff of the TIMER routine.

I think/hope that you'll see that the code that I posted above is immune to this problem, and needs no special handling. You just disable the sample playback (or interrupts) before changing the pointer to the sample. Trivial.

It should also be completely safe for use around hsync interrupts, and 32-byte Txx transfer instructions.

elmer · « **Reply #147 on:** December 12, 2016, 05:04:58 PM »

Quote from: Bonknuts on December 12, 2016, 04:01:25 PM

Well, compare that to Air Zonk. Air Zonk is almost 30% cpu resource max (and it hits that max every time a sample is playing)! Sure, it's decompressing samples - but.. 30%! Pfft. ~13% is fine.

Reply move to the PCM thread to stop thread-bombing Arkhan!

Bonknuts · « **Reply #148 on:** December 12, 2016, 05:13:13 PM »

Quote from: elmer on December 12, 2016, 05:04:58 PM

Quote from: Bonknuts on December 12, 2016, 04:01:25 PM
Well, compare that to Air Zonk. Air Zonk is almost 30% cpu resource max (and it hits that max every time a sample is playing)! Sure, it's decompressing samples - but.. 30%! Pfft. ~13% is fine.

Reply move to the PCM thread to stop thread-bombing Arkhan!

Since you're doing your own engine/driver/whatever.. make a new thread! I got something I wanted to share with you, now that you have a TIRQ routine.

Arkhan · « **Reply #149 on:** January 05, 2017, 05:58:42 PM »

LOL.

This sandbox MMO called Archeage apparently uses MML, and even has a little tutorial in it.

I loled pretty hard at this discovery .

Metallica, lol, wtf:

Author Topic: MML: What are people's actual complaints with the damn thing (Read 7409 times)

Arkhan

Re: MML: What are people's actual complaints with the damn thing

elmer

Re: MML: What are people's actual complaints with the damn thing

Bonknuts

Re: MML: What are people's actual complaints with the damn thing

ccovell

Re: MML: What are people's actual complaints with the damn thing

elmer

Re: MML: What are people's actual complaints with the damn thing

Bonknuts

Re: MML: What are people's actual complaints with the damn thing

Bonknuts

Re: MML: What are people's actual complaints with the damn thing

elmer

Re: MML: What are people's actual complaints with the damn thing

Bonknuts

Re: MML: What are people's actual complaints with the damn thing

Arkhan

Re: MML: What are people's actual complaints with the damn thing

Arkhan

Re: MML: What are people's actual complaints with the damn thing

elmer

Re: MML: What are people's actual complaints with the damn thing

elmer

Re: MML: What are people's actual complaints with the damn thing

Bonknuts

Re: MML: What are people's actual complaints with the damn thing

Arkhan

Re: MML: What are people's actual complaints with the damn thing