Here's something that might clear things up and also explain why 2 frameskip is needed right now.
During every frame (16.67 milliseconds), emulation is carried out for that frame, then the image is rendered, then the sound. The image takes very little time to generate, but 23 milliseconds to send - which is more than 1 frame time. Controls are synchronised after that, in a queue of requests to the DS. (The cart bus time is to scale; the CPU time might not be, like early frames and late frames)
Exhibit A: Bad synchronisation at frameskip 1
The above picture is at frameskip 2. At frameskip 1, the time it takes to send the image and the sound means that, sometimes, the controls are skipped because another image is ready at that exact moment. (Move the second large green rectangle 1 frame to the left in your head)
If that happens, controls are not updated until the cart bus starts to desynchronise again. Frameskip 2 fixes this.
Exhibit B: Copying missing pixels to the bottom screen
To carry out an update to the bottom screen, tack an additional 23 milliseconds to send the bottom screen, even if only 1/6 of it is used. That means the cart bus becomes
Image Sound Controls Image Sound Controls for 50 milliseconds. By that time, it's possible that the controls are bad because an image is ready at that exact moment. So you'd need frameskip 3 at that point, and the image would update in steps: up first, bottom next, up next, bottom next, not all at once. That would get weird rather fast, too.
The Nintendo DS, if using its hardware acceleration like SNEmulDS and nesDS, can start drawing the bottom screen from the same graphics within the same frame. This is much better, but unfortunately that method is unavailable when coding on the DSTwo's MIPS processor.