13.18% worst-case cost for two channels of samples isn't stunningly great, but it doesn't seem too bad, either.
Well, compare that to Air Zonk. Air Zonk is almost 30% cpu resource max (and it hits that max every time a sample is playing)! Sure, it's decompressing samples - but.. 30%! Pfft. ~13% is fine.
Yeah, once you have the overhead of everything, adding additional channels for sample streaming doesn't increase it a whole lot.
Just some thoughts:
On your 2nd channel one. I would add a cli, nop, sei right after .channel4. Your worst case scenario for each channel is 68 cycle delay for H-int, which would probably be fine, but I wouldn't push it with
twice that in a worse case scenario.
Also, by not having a busy flag system (have interrupts open for the whole thing)- you're setup is going to be little less friendly with code using small Txx in 32byte segments during active display - and worse case scenarios in all settings (H-int and TIRQ). Just something to note. Might want to recommend or write block transfers with 16byte or 8byte segments with Txx.
These following ideas might not be popular for HuC, but I'll mention them anyway:
It won't save a whole lot, but if you pad sample with some 0's (assume all samples are made of 116 byte segments). You can run a segment length counter in vsync, and remove the BMI to .EOF check, it only saves 2 cycles per channel though.
Likewise, you can speed up if you're using two or more channels - if you store 116 samples as 128 byte segments (bytes 117 to 127 are null - do nothing). And then align them to 128byte boundary in rom/memory. A little bit a growth (the size of 7.68khz but the playback of 7khz), but now you don't need to check bank boundary crossing in the TIRQ routine itself - AND..
all samples regardless of their memory address can all use
one Y index reg value. Incrementing to the next 128 segment would be done during vsync or on the last 116th call. Stick the TIRQ routine inside of ram, and you save some more cycles with self-modifying code. What's the thing take up now? 200 bytes?
Something I'm curious about: Why channel's 3 and 4? Why not channels 0 and 1? Channel 0 saves you 2 cycles. And leaving channels 4 and 5 free allow noise mode for both of those while samples are playing. 7khz is pretty good for almost all drumkit samples, but not so great at short/closed hi-hats. Noise channel is good for those. Just some thoughts.