thrush: I'm all ASM for PC-Engine coding, so if you need someone to bug via such related questions then you can add me to your list. I'm also known as Tomaitheous (and usually hang out on mednafen channel IRC, when I have time). There are some others that are active ASM only coders for PCE; MooZ, Charles MacDonald, Chris Covell to name a few.
http://pcedev.blockos.org/ also might be a place you want to ask questions or such too. It's a bit slow, but MooZ always checks the forum (he runs it).
MooZ, Charles, and I don't use the MagicKit included libraries. We have our own (and many times just written over again on the fly). I found the documentation for Mkits libs seriously lacking. Building your own lib is fairly easy and gives you more freedom and control for your layout, etc. Among some other small advantages. If you're compiling PCEAS, use the last public source for it that's include in HuC source kit. I have my own private build with a few upgrades added in (8k bank boundary crossing error for Code is gone, new directives for easier table making, better name support for Macros, - and + local labeling, etc). MooZ does too, IIRC. I was talking with MooZ and we're supposed to make the new fork/build public. Just waiting on him. Won't be official, but it'll be our own version.
I don't own the NeoFlash, but after chatting with Chilly Willy I found out my demos wouldn't run for him because PCEAS builds them with the header by default. Yeah, -raw assembles without a header. Which I use for one of my flash cards (just a dip rom board).
HUC stuff: Someone mentioned something about replacing/reworking the scroll() routine in HuC, in one of these huc related threads. The routine is indeed part of the Mkit lib and is basically just a pass through function. I made an optimized replacement for Xavier. It eats some more memory, but it's fast and very flexible. It behaves like the HDMA setup on the SNES, Or you could think of it similar to 'copper' on the Amiga. It works by using a main table, 8bits per entry. The each of the 8bits is a command telling the hsync routine what to do; change BG color #0, change X, change Y, turn on BG, turn off BG, turn on SPR, turn off SPR, etc. There's a scanline list in a separate table telling which scanline to generate the next function list on. You can do any and all the functions at the same time for a given scanline. And you can do all visible scanlines. It uses the existing irqmask setup that HuC uses (which is a clone of the system card setup and is actually used in place of when building CD projects). I think it worked out fairly nice and gave some powerful hsync functionality to HuC setup.