Great! Thanks!
I tried to use the cpu_arm_asm.s code in gameblabla's version on Github, and you'd be surprised to hear that it actually runs worse than cpu.c (compiled with O3 optimisation)
Here are the timings running on Salamander's title screen (for 60 frames - 1000ms)
cpu.c
CPU Emulation: avg 710ms
GPU Rendering: avg 55ms
cpu_arm_asm.s (CPU_ARM_FAST_MODE defined)
CPU Emulation: avg 920ms
GPU Rendering: avg 55ms
*CPU emulation includes all the read/writes to and from TG-16's memory and hardware registers (but does not include any rendering)
That was my experience using the cpu_arm_asm.s (at least based on gameblabla's version) that I told you before. I've tried to integrate yours into this port, but apparently I still need some work the sync the cpu_struct / irq_struct / memory_struct layout with the asm version. Haven't got much time to fix that part.
I looked at the code again and unless I'm missing something you have DEBUGGER_ON defined in platform_defines.h. You should definitely take that out and test both sides again because the debugger will slow down both CPU interpreters a lot. The slowdown is even worse in the ARM ASM version because it has a pre-check which checks for countdowns (which won't be the case by default) and it has to save and restore more state to go in and out of the debug function. So that'd explain why both measurements look really bad and the ARM one looks even worse.
I'd also pass this on to gameblabla in case it's happening in the GCW Zero port too.
Given these timings, I don't even want to go to the software rendering part. Blitting pixels on the 3DS hardware is helluva slow.
Can you explain further what this means? The part that blits is in software and it shouldn't be that slow with a sufficiently optimized renderer. From what I've read, direct framebuffer updates on 3DS are slow. I don't know anything else about this, like if it's because of post-processing (I hear the layout is rotated and maybe something is rotating it back? Which is murder on a CPU that doesn't have L2 cache to store a framebuffer) or maybe the MMU isn't configured to efficiently write to this region of memory, like with write buffering and coalescing turned on.
But you should be able to render to a texture and use the GPU to render it instead of drawing directly to the framebuffer.
For comparison, in the worst case you're re-caching everything visible on the screen when the entire palette changes. In 256x240 resolution that can be up to 33x31 unique tiles for the background and since you're not doing anything to limit the number of sprites shown per scanline it can be up to 64 unique 32x64 sprites. That'd be a total of 196544 pixels, although in practice the VRAM can only store about 62KB of tile and pattern data, so the real limit is around 126976 pixels unless the same tiles/patterns are being used with different palettes. Still, that's a lot higher than the 61440 pixels you'd need to update if you uploaded a software rendered 256x240 framebuffer. And it'd come in the form of one large texture upload instead of many small ones.
And when you have to cache everything you end up doing a lot of the work you'd do in a software renderer, like palette lookups.
This may not show up when benchmarking something relatively static like a title screen but it can turn into a real sudden grind, like if the game fades out the screen.