Homebrew Challenge: Improve PicodriveDS code

xonn

Well-Known Member
OP
Member
Joined
Jan 11, 2020
Messages
148
Trophies
0
Age
34
XP
893
Country
Spain
Hello everyone.
I'm "fighting" against PicodriveDS code in order to improve it performance.
The main reason is to get a decent Genesis emulator that could be played in DS bottom screen. Now I'm a bit noob coding in ASM, so I'm trying to port some easy functions to a .S file.
After some changes, I have discovered that FPS are not significantly increased, so I'm starting to think that the problem could be on one or more of those parts:
  • Cyclone 68000 emulator originally used was v0.084, and there's a bit updated version: v0.088. Maybe an update could improve performance? The problem is that I can't get a compiled version that works with PicodriveDS
  • Port more functions to ASM, specially all related to pixel process
  • Maybe process frame in another way? Now, it seems that the program process each horizontal line, changing byte color info format (Genesis format to DS format) and printing the result at the end.
  • The program was written using Devkitarm v20. Anyone knows if updating the code to a newer version would improve the performance?
All changes can be checked in following github. All help and ideas are welcomed :)

Update 1:
Coding in ASM is a nightmare...
Anyway, there are visible improvements:
- Frameskip now is set from 4 to 3 and gameplay has not been affected (I think...)
- All new ASM functions are stored in Functions_asm.s

Update 2:
Coding in ASM is still a nightmare :(
- More functions coded and now the current frameskip is smoother (maybe in the future we could use this free cpu time to activate sound?)
- Function DmaFill is broken... anybody can check it, I have spent hours and the problem has not been located, so it has been renamed to DmaFill_fail and uncomment the old one in VideoPort.c

Update 3:
- DrawAllSprites and DrawSprite has been re-coded in ASM. DrawAllSprites is called in every frame and DrawSprite is called individually for each sprite. However, there's not an important change in FPS...
- Version will be 0.1.8 from now :yay:

Update 4:
- All code has been replaced with an adapted TwilightMenu version of Picodrive. Now it works independently of TWL.
- Some ASM have been undo. Sadly, my ASM code style sometimes is slower than C code :(
- Anybody knows why line 1053 of main.cpp now returns data abort when rom is changed? I have commented the line, but I'm not sure if it's a good idea...

Update 5:
- UpdatePalette functions has been called just before draw every frame, but palette changes less frequently. Now it's invoked only when there are more than 9 bytes changed in CRAM. With this change, FPS have been increased slightly, but not at any cost: the update time of the color palette shows inconsistencies for an instant when a scene changes.

Update 6:
- PicoRead8 function coded in ASM
- OtherRead16 function coded in ASM
- Damn... FPS are stucked in 45-50 (max frameskip=2) in a NDSLite for Sonic1 :cry:
- Damn again... Flashback game doesn't start after the developer logo

Update 7:
- Memory ARM functions now are in a separate .S file called: Memory_asm.s
- PicoRead8, PicoRead16 and PicoRead32 functions coded in ASM. The code could be improved, so everybody is invited to check it
- In a DSi console, the extra power help some games to run smooootly (55-60fps) :)
- UpdatePalette function now is called each 15 vblanks. Fade effects are not the bests, but we save some cycles.

To try it by yourself, download NDS file from master branch, not the release.
 
Last edited by xonn,

xonn

Well-Known Member
OP
Member
Joined
Jan 11, 2020
Messages
148
Trophies
0
Age
34
XP
893
Country
Spain
Why not use PicoDriveTWL as a base? It compiles on the latest devkitarm version
I have downloaded lastest version of PicodriveTWL and it doesn't work on a R4 card (original R4 with Wood firmware). Would be great to work with this updated version if it could be enjoyed in DS and DSi consoles :(
 
  • Like
Reactions: Shadow#1

NightScript

Well-Known Member
Member
Joined
Feb 7, 2016
Messages
951
Trophies
1
Age
20
XP
2,230
Country
United States
I have downloaded lastest version of PicodriveTWL and it doesn't work on a R4 card (original R4 with Wood firmware). Would be great to work with this updated version if it could be enjoyed in DS and DSi consoles :(
Launch with TWiLight Menu++ instead.
Also, limit has been dropped to 2.5 for some reason @Robz8 can explain why
 
  • Like
Reactions: RocketRobz

NightScript

Well-Known Member
Member
Joined
Feb 7, 2016
Messages
951
Trophies
1
Age
20
XP
2,230
Country
United States
It's because the ARM7 binary has moved to main memory, in order for the emulator to be launched as a CIA.
Do you think you could make separate builds for CIA and homebrew launchers? It would be helpful for DS Lite users, who either way don't have CIA capabilities.
 

xonn

Well-Known Member
OP
Member
Joined
Jan 11, 2020
Messages
148
Trophies
0
Age
34
XP
893
Country
Spain
Updated first message with some news...

Do you think you could make separate builds for CIA and homebrew launchers? It would be helpful for DS Lite users, who either way don't have CIA capabilities.
Porting all new ASM functions between "old" and "TWL" version should be easy, just comment the C functions, declare them and add Functions_asm.s file inside /source/pico/ folder
 

Coto

-
Member
Joined
Jun 4, 2010
Messages
2,979
Trophies
2
XP
2,564
Country
Chile
- Some ASM have been undo. Sadly, my ASM code style sometimes is slower than C code :(

ARM processors do hate context switch (doing some OS task, and entering an user function). As a general rule in emulators, sometimes it is better to prevent too many jumps (function re-entrancy) even if it's in TCM memory. Because code still has to be handled by the prefetch unit and caches (if enabled). If code segment is too large it won't fit caches and be retrieved from slower memory. So calling inlined (short code) C code from ITCM can really speed up timing dependant harware functionality and/or emulator pieces (which gets called too often and on exact intervals).

ARM Assembly:
If you can rewrite the emulated CPU in assembly it's going to be definitely faster. But doing that requires to know extensive CPU knowledge of both systems (The one emulated and the host).

Always use simpler (less cycle) ARM opcodes. Try to avoid LDR/STR(x) opcodes if reading inmediate values (hardcoded values) and use instead MOV,ADD,SUB and/or the barrel shifter unit for multiplications or literally anything that involves scaling numbers in steps of 2).

Code:
MOV r1, r3, LSR #7 equals r1 = r3/128
ADD r9,r8,r8,LSL #2 equals r9=r8*5
 
Last edited by Coto,

xonn

Well-Known Member
OP
Member
Joined
Jan 11, 2020
Messages
148
Trophies
0
Age
34
XP
893
Country
Spain
Thanks @Coto for your message. It's really instructive.
New changes made. Please check emulator and post your thoughts :)
 
  • Like
Reactions: Coto

Aikku

Member
Newcomer
Joined
Jun 12, 2010
Messages
11
Trophies
1
Age
30
Location
Aussie
XP
299
Country
Australia
I had a look at pico/Functions_asm.s. Much improvement can be made here, but the biggest things that come to mind are:
1) Refactor expressions to take advantage of the barrel shifter and status register flags. As an extreme example, if you change line 197 to "ADDS r9, r3, #0" (the main point being to clear the carry flag), then lines 200..202 can be changed to a single instruction: "RSC r5, r5, r3, lsl #3".
2) Since this is ARM9 (at least I'm assuming it is; if this is to be executed on ARM7, ignore the advice related to ARM9), avoid using memory-loaded registers right away (eg. lines 25..26). Using the register that was loaded from memory right away incurs a 1c penalty (for 32bit loads; 2c for 8bit and 16bit reads).
3) You don't need BX to return on ARM9, even when changing ARM<->THUMB. So if you pushed the link register to the stack, you can pop it straight back into pc to return (eg. "STMFD sp!, {lr} ... LDMFD sp!, {pc}" or "PUSH {lr} ... POP {pc}" in THUMB).
4) Branches cost at least 3c. So if the opposite condition's code can be made conditional (eg. lines 200-202) and cost 3c or less, do that and avoid the branch. For performance comparison (ignoring cache effects), the code on lines 199-202 takes 3c for the Z=1 path, but 4c for the Z=0 path (average of 3.5c); if it was made conditional, both paths would take 3c.
5) Similar to point #2, avoid using MUL/MLA results right away on ARM9 (eg. lines 209-210) as these incur a 1c penalty (as an aside, those lines can be combined into "MLA r8, r2, r9, r8"). On a related note, if you know the bounds of your registers, you might have better performance using SMULxy/SMLAxy.
6) Combine conditional branches (eg. lines 304-307 can be changed to "CMP r5, #80; CMPNE r4, #21; BEQ .endwhile1das")
7) Use conditionals more freely (eg. lines 408-409 can be changed into a single "TST r8, #65536", but the whole expression on lines 408-413 can be changed to "TST r8, #65536; RSBNE r3, r3, #0").
 
Last edited by Aikku,
  • Like
Reactions: wariobar

xonn

Well-Known Member
OP
Member
Joined
Jan 11, 2020
Messages
148
Trophies
0
Age
34
XP
893
Country
Spain
I had a look at pico/Functions_asm.s. Much improvement can be made here, but the biggest things that come to mind are:
1) Refactor expressions to take advantage of the barrel shifter and status register flags. As an extreme example, if you change line 197 to "ADDS r9, r3, #0" (the main point being to clear the carry flag), then lines 200..202 can be changed to a single instruction: "RSC r5, r5, r3, lsl #3".
2) Since this is ARM9 (at least I'm assuming it is; if this is to be executed on ARM7, ignore the advice related to ARM9), avoid using memory-loaded registers right away (eg. lines 25..26). Using the register that was loaded from memory right away incurs a 1c penalty (for 32bit loads; 2c for 8bit and 16bit reads).
3) You don't need BX to return on ARM9, even when changing ARM<->THUMB. So if you pushed the link register to the stack, you can pop it straight back into pc to return (eg. "STMFD sp!, {lr} ... LDMFD sp!, {pc}" or "PUSH {lr} ... POP {pc}" in THUMB).
4) Branches cost at least 3c. So if the opposite condition's code can be made conditional (eg. lines 200-202) and cost 3c or less, do that and avoid the branch. For performance comparison (ignoring cache effects), the code on lines 199-202 takes 3c for the Z=1 path, but 4c for the Z=0 path (average of 3.5c); if it was made conditional, both paths would take 3c.
5) Similar to point #2, avoid using MUL/MLA results right away on ARM9 (eg. lines 209-210) as these incur a 1c penalty (as an aside, those lines can be combined into "MLA r8, r2, r9, r8"). On a related note, if you know the bounds of your registers, you might have better performance using SMULxy/SMLAxy.
6) Combine conditional branches (eg. lines 304-307 can be changed to "CMP r5, #80; CMPNE r4, #21; BEQ .endwhile1das")
7) Use conditionals more freely (eg. lines 408-409 can be changed into a single "TST r8, #65536", but the whole expression on lines 408-413 can be changed to "TST r8, #65536; RSBNE r3, r3, #0").
Thanks for all the info. I don't have much experience with assembly language, so I supposed that my code has a lot of weak points.
In NeoDS emulator, the author prepared a lot of routines for Cyclone in ASM, and the performance is great. Is a pity that jEnesisDS source code isn't available, so the only solution to improve Genesis emulator is through PicodriveDS :(
 
Last edited by xonn,

xonn

Well-Known Member
OP
Member
Joined
Jan 11, 2020
Messages
148
Trophies
0
Age
34
XP
893
Country
Spain
I have some questions for those of you who know ARM assembler a little better.
  1. Is there a better way to load "difficult" literals? For example, how would be a good substitution for LDR r0, =0xA0400 ?
  2. Everytime I call a subfunction with BL, I save into stack registers r1 to r3, but them are known as "scratch" registers... Is necessary to do it?
Thanks for your support :)

Edit: Version updated, more info on first post
 
Last edited by xonn,

Aikku

Member
Newcomer
Joined
Jun 12, 2010
Messages
11
Trophies
1
Age
30
Location
Aussie
XP
299
Country
Australia
1. Some literals you can build quickly (for example, A0400h could be the two-cycle sequence "MOV r0, #0xA0000; ORR r0, r0, #0x400"), but if you're taking more than three cycles to do it (or two cycles for ITCM code on ARM9), you just use LDR.
2. By convention, you only need to save r4-r11/fp and r14/lr, and even then only the ones you modify (eg. if your function only modifies r4 and lr/r14, you only need to save those registers). Also generally (on ARM9 at least), you'll want to keep your stack aligned to 8 bytes (64 bits) for LDRD instructions (afaik, it will still work on NDS even when not aligned to 8 bytes, but some other ARM chips will through an exception). Additionally, if you only need to save one register, it's 1c faster to use STR/LDR rather than STMFD/LDMFD afaik.
 
Last edited by Aikku,

Site & Scene News

Popular threads in this forum

General chit-chat
Help Users
  • A @ anotherthing:
    Online was free on the 3DS as well, and it worked well.
    +1
  • S @ salazarcosplay:
    I can't help but think nintendo switch online is the reason we did not get a gameboy/n64/gamecube
    mini classic edition
    +1
  • Xdqwerty @ Xdqwerty:
    @salazarcosplay, i think it would have been impossible to see anything in a gameboy mini
  • S @ salazarcosplay:
    well I meant that as a figure of speech
  • S @ salazarcosplay:
    they could just rerelease a modern gameboy
  • S @ salazarcosplay:
    like the pocket analogue
  • S @ salazarcosplay:
    but nintendo licensed
  • LeoTCK @ LeoTCK:
    dammit that thread got moved from offtopic to edge, well since that happened
  • Xdqwerty @ Xdqwerty:
    @LeoTCK, atleast it's still avaliable
  • LeoTCK @ LeoTCK:
    yes but it wasn't meant to be a comedy thread
  • LeoTCK @ LeoTCK:
    and edge of the forum is mostly comedy and games
  • LeoTCK @ LeoTCK:
    so I don't get why it got moved at all
  • Xdqwerty @ Xdqwerty:
    @LeoTCK, mods are probably hating you
  • LeoTCK @ LeoTCK:
    on most sites mods hated me, sooner or later, but usually over time I get either banned or the mods get used to me
  • LeoTCK @ LeoTCK:
    sometimes to the point of thanking me for my quick actions etc against spam and other stuff, but yea...its either they come to respect me or outright hate me
    +1
  • BigOnYa @ BigOnYa:
    If it's not game related, it will be moved to the Egde of the forum. Mods have moved a few of my threads also.
  • Xdqwerty @ Xdqwerty:
    @BigOnYa, it was in the off topic chat forum
  • BigOnYa @ BigOnYa:
    Well atleast they didn't delete it completely.
  • LeoTCK @ LeoTCK:
    hmm
  • Xdqwerty @ Xdqwerty:
    uoiea
  • LeoTCK @ LeoTCK:
    huh?
  • Xdqwerty @ Xdqwerty:
    Aeiou backwards
    Xdqwerty @ Xdqwerty: Aeiou backwards