I experimented a bit with this emulator and MelonDS a few days ago and as far as I can tell the Switches CPU isn't powerful enough for DS emulation as it is in it's current state. But there are two techniques I know of to reduce the CPU load:
1. GPU acceleration. The software rasterizer might be more accurate, but it's also incredible slow and probably the main bottleneck of the emulation. Unfortunately GPU access is still being worked on, so it's currently not possible for homebrew apps to gain access to hw graphics acceleration.
2. JIT compilation. The ARM interpreter is also quite resource intensive. JIT compilation could greatly speed things up. What's standing in the way is the CPU architecture. Desmume supports JIT compilation for x86 and 32-Bit ARM platforms, but not ARM64. Implementing it would be quite an undertakement, although it's inside the bounds of practibilty.
What still perplexes me is that when running lakka Desmume ran at fullspeed with GPU accleration disabled. It's possible that I was running a newer(or older) version of Desmume which was better optimized, also it's possible that lakka clocks the CPU at a higher speed(although I think it only clocks the GPU at a higher speed. And at last I could have simply did something wrong while testing it and the hw accelerated rendering wasn't disabled.
But that's just me, I would love to be proved wrong by someone who manages to get a fullspeed emulation without JIT compilation or GPU access!