and reduce it to 1 mapping for 4 samples(4 bytes)
If switched to a buffer system, there would no mapping (the buffer should be in fixed system ram).
Doing a buffer system is faster, but it also has some requirements. It's going to require a two buffers in ram for all channels; a timer is 1024cycles between interrupt - you're going to copy 4x116bytes in 1024 cycles? Not gonna happen. Even just one channel gets too close for comfort (713cycles via Txx).
It's not just the bank mapping that the buffer system reduces. There's no MSB check on the buffer inside the TIRQ routine. Though that only saves you +2 cycles per sample, per channel. You could remove the EOF marker, and simply have all samples trail out zeros or $0f - both work (any value works, actually). So there's another +2 cycles per sample per channel saved.
Don't get me wrong; I use the double buffer system for my own stuff. But sometimes it's easier when you give other people functionality - to keep the interface a little more simple, and just eat a little overhead.
For a single channel buffer system; you'd save ~1.8% cpu overhead. For two channel buffer, you save ~2.2%. For four channel buffer, you save ~2.7%. It's not a whole lot. The reason being, is that mapping in a channel is only 9 cycles (lda <zp: tam #$nn). The larger overhead is from the tma #n:pha and pla:tam for saving the MPR. That's 16 cycles overhead, but that overheard basically gets divide down as more channels are output inside the routine. So the biggest cost savings is single channel use, relative to per channel savings.
Maybe I should be more clear; if you have 4 samples to stream - you don't map them into 4 individual banks. There's no reason to. You map them in sequential order, to the same MPR reg, as you use them. That way you only need to save/restore one bank for <n> number of channels to stream from. My above overhead savings assumes this. If it didn't, then you'd take the 1.8% and multiple that by the number of channels used as your total savings. But that shouldn't be the case.
What I do like about the buffer system, over the slight savings, is the flexibility of it. You can support both compressed samples and uncompressed samples. You could also support half frequency samples (3.5khz instead of 7khz; some sound FX actually sound decent at this playback rate. There are some PCE games that do this; playback samples at both rates). Do all kinds of stuff, and the main TIMER routine wouldn't have to know anything, other than what's in the buffer.