Homebrew Challenge: Improve PicodriveDS code

xonn

Well-Known Member
OP
Member
Joined
Jan 11, 2020
Messages
148
Trophies
0
Age
34
XP
893
Country
Spain
Hello everyone.
I'm "fighting" against PicodriveDS code in order to improve it performance.
The main reason is to get a decent Genesis emulator that could be played in DS bottom screen. Now I'm a bit noob coding in ASM, so I'm trying to port some easy functions to a .S file.
After some changes, I have discovered that FPS are not significantly increased, so I'm starting to think that the problem could be on one or more of those parts:
  • Cyclone 68000 emulator originally used was v0.084, and there's a bit updated version: v0.088. Maybe an update could improve performance? The problem is that I can't get a compiled version that works with PicodriveDS
  • Port more functions to ASM, specially all related to pixel process
  • Maybe process frame in another way? Now, it seems that the program process each horizontal line, changing byte color info format (Genesis format to DS format) and printing the result at the end.
  • The program was written using Devkitarm v20. Anyone knows if updating the code to a newer version would improve the performance?
All changes can be checked in following github. All help and ideas are welcomed :)

Update 1:
Coding in ASM is a nightmare...
Anyway, there are visible improvements:
- Frameskip now is set from 4 to 3 and gameplay has not been affected (I think...)
- All new ASM functions are stored in Functions_asm.s

Update 2:
Coding in ASM is still a nightmare :(
- More functions coded and now the current frameskip is smoother (maybe in the future we could use this free cpu time to activate sound?)
- Function DmaFill is broken... anybody can check it, I have spent hours and the problem has not been located, so it has been renamed to DmaFill_fail and uncomment the old one in VideoPort.c

Update 3:
- DrawAllSprites and DrawSprite has been re-coded in ASM. DrawAllSprites is called in every frame and DrawSprite is called individually for each sprite. However, there's not an important change in FPS...
- Version will be 0.1.8 from now :yay:

Update 4:
- All code has been replaced with an adapted TwilightMenu version of Picodrive. Now it works independently of TWL.
- Some ASM have been undo. Sadly, my ASM code style sometimes is slower than C code :(
- Anybody knows why line 1053 of main.cpp now returns data abort when rom is changed? I have commented the line, but I'm not sure if it's a good idea...

Update 5:
- UpdatePalette functions has been called just before draw every frame, but palette changes less frequently. Now it's invoked only when there are more than 9 bytes changed in CRAM. With this change, FPS have been increased slightly, but not at any cost: the update time of the color palette shows inconsistencies for an instant when a scene changes.

Update 6:
- PicoRead8 function coded in ASM
- OtherRead16 function coded in ASM
- Damn... FPS are stucked in 45-50 (max frameskip=2) in a NDSLite for Sonic1 :cry:
- Damn again... Flashback game doesn't start after the developer logo

Update 7:
- Memory ARM functions now are in a separate .S file called: Memory_asm.s
- PicoRead8, PicoRead16 and PicoRead32 functions coded in ASM. The code could be improved, so everybody is invited to check it
- In a DSi console, the extra power help some games to run smooootly (55-60fps) :)
- UpdatePalette function now is called each 15 vblanks. Fade effects are not the bests, but we save some cycles.

To try it by yourself, download NDS file from master branch, not the release.
 
Last edited by xonn,

xonn

Well-Known Member
OP
Member
Joined
Jan 11, 2020
Messages
148
Trophies
0
Age
34
XP
893
Country
Spain
Why not use PicoDriveTWL as a base? It compiles on the latest devkitarm version
I have downloaded lastest version of PicodriveTWL and it doesn't work on a R4 card (original R4 with Wood firmware). Would be great to work with this updated version if it could be enjoyed in DS and DSi consoles :(
 
  • Like
Reactions: Shadow#1

NightScript

Well-Known Member
Member
Joined
Feb 7, 2016
Messages
951
Trophies
1
Age
20
XP
2,227
Country
United States
I have downloaded lastest version of PicodriveTWL and it doesn't work on a R4 card (original R4 with Wood firmware). Would be great to work with this updated version if it could be enjoyed in DS and DSi consoles :(
Launch with TWiLight Menu++ instead.
Also, limit has been dropped to 2.5 for some reason @Robz8 can explain why
 
  • Like
Reactions: RocketRobz

NightScript

Well-Known Member
Member
Joined
Feb 7, 2016
Messages
951
Trophies
1
Age
20
XP
2,227
Country
United States
It's because the ARM7 binary has moved to main memory, in order for the emulator to be launched as a CIA.
Do you think you could make separate builds for CIA and homebrew launchers? It would be helpful for DS Lite users, who either way don't have CIA capabilities.
 

xonn

Well-Known Member
OP
Member
Joined
Jan 11, 2020
Messages
148
Trophies
0
Age
34
XP
893
Country
Spain
Updated first message with some news...

Do you think you could make separate builds for CIA and homebrew launchers? It would be helpful for DS Lite users, who either way don't have CIA capabilities.
Porting all new ASM functions between "old" and "TWL" version should be easy, just comment the C functions, declare them and add Functions_asm.s file inside /source/pico/ folder
 

Coto

-
Member
Joined
Jun 4, 2010
Messages
2,979
Trophies
2
XP
2,564
Country
Chile
- Some ASM have been undo. Sadly, my ASM code style sometimes is slower than C code :(

ARM processors do hate context switch (doing some OS task, and entering an user function). As a general rule in emulators, sometimes it is better to prevent too many jumps (function re-entrancy) even if it's in TCM memory. Because code still has to be handled by the prefetch unit and caches (if enabled). If code segment is too large it won't fit caches and be retrieved from slower memory. So calling inlined (short code) C code from ITCM can really speed up timing dependant harware functionality and/or emulator pieces (which gets called too often and on exact intervals).

ARM Assembly:
If you can rewrite the emulated CPU in assembly it's going to be definitely faster. But doing that requires to know extensive CPU knowledge of both systems (The one emulated and the host).

Always use simpler (less cycle) ARM opcodes. Try to avoid LDR/STR(x) opcodes if reading inmediate values (hardcoded values) and use instead MOV,ADD,SUB and/or the barrel shifter unit for multiplications or literally anything that involves scaling numbers in steps of 2).

Code:
MOV r1, r3, LSR #7 equals r1 = r3/128
ADD r9,r8,r8,LSL #2 equals r9=r8*5
 
Last edited by Coto,

xonn

Well-Known Member
OP
Member
Joined
Jan 11, 2020
Messages
148
Trophies
0
Age
34
XP
893
Country
Spain
Thanks @Coto for your message. It's really instructive.
New changes made. Please check emulator and post your thoughts :)
 
  • Like
Reactions: Coto

Aikku

Member
Newcomer
Joined
Jun 12, 2010
Messages
11
Trophies
1
Age
30
Location
Aussie
XP
299
Country
Australia
I had a look at pico/Functions_asm.s. Much improvement can be made here, but the biggest things that come to mind are:
1) Refactor expressions to take advantage of the barrel shifter and status register flags. As an extreme example, if you change line 197 to "ADDS r9, r3, #0" (the main point being to clear the carry flag), then lines 200..202 can be changed to a single instruction: "RSC r5, r5, r3, lsl #3".
2) Since this is ARM9 (at least I'm assuming it is; if this is to be executed on ARM7, ignore the advice related to ARM9), avoid using memory-loaded registers right away (eg. lines 25..26). Using the register that was loaded from memory right away incurs a 1c penalty (for 32bit loads; 2c for 8bit and 16bit reads).
3) You don't need BX to return on ARM9, even when changing ARM<->THUMB. So if you pushed the link register to the stack, you can pop it straight back into pc to return (eg. "STMFD sp!, {lr} ... LDMFD sp!, {pc}" or "PUSH {lr} ... POP {pc}" in THUMB).
4) Branches cost at least 3c. So if the opposite condition's code can be made conditional (eg. lines 200-202) and cost 3c or less, do that and avoid the branch. For performance comparison (ignoring cache effects), the code on lines 199-202 takes 3c for the Z=1 path, but 4c for the Z=0 path (average of 3.5c); if it was made conditional, both paths would take 3c.
5) Similar to point #2, avoid using MUL/MLA results right away on ARM9 (eg. lines 209-210) as these incur a 1c penalty (as an aside, those lines can be combined into "MLA r8, r2, r9, r8"). On a related note, if you know the bounds of your registers, you might have better performance using SMULxy/SMLAxy.
6) Combine conditional branches (eg. lines 304-307 can be changed to "CMP r5, #80; CMPNE r4, #21; BEQ .endwhile1das")
7) Use conditionals more freely (eg. lines 408-409 can be changed into a single "TST r8, #65536", but the whole expression on lines 408-413 can be changed to "TST r8, #65536; RSBNE r3, r3, #0").
 
Last edited by Aikku,
  • Like
Reactions: wariobar

xonn

Well-Known Member
OP
Member
Joined
Jan 11, 2020
Messages
148
Trophies
0
Age
34
XP
893
Country
Spain
I had a look at pico/Functions_asm.s. Much improvement can be made here, but the biggest things that come to mind are:
1) Refactor expressions to take advantage of the barrel shifter and status register flags. As an extreme example, if you change line 197 to "ADDS r9, r3, #0" (the main point being to clear the carry flag), then lines 200..202 can be changed to a single instruction: "RSC r5, r5, r3, lsl #3".
2) Since this is ARM9 (at least I'm assuming it is; if this is to be executed on ARM7, ignore the advice related to ARM9), avoid using memory-loaded registers right away (eg. lines 25..26). Using the register that was loaded from memory right away incurs a 1c penalty (for 32bit loads; 2c for 8bit and 16bit reads).
3) You don't need BX to return on ARM9, even when changing ARM<->THUMB. So if you pushed the link register to the stack, you can pop it straight back into pc to return (eg. "STMFD sp!, {lr} ... LDMFD sp!, {pc}" or "PUSH {lr} ... POP {pc}" in THUMB).
4) Branches cost at least 3c. So if the opposite condition's code can be made conditional (eg. lines 200-202) and cost 3c or less, do that and avoid the branch. For performance comparison (ignoring cache effects), the code on lines 199-202 takes 3c for the Z=1 path, but 4c for the Z=0 path (average of 3.5c); if it was made conditional, both paths would take 3c.
5) Similar to point #2, avoid using MUL/MLA results right away on ARM9 (eg. lines 209-210) as these incur a 1c penalty (as an aside, those lines can be combined into "MLA r8, r2, r9, r8"). On a related note, if you know the bounds of your registers, you might have better performance using SMULxy/SMLAxy.
6) Combine conditional branches (eg. lines 304-307 can be changed to "CMP r5, #80; CMPNE r4, #21; BEQ .endwhile1das")
7) Use conditionals more freely (eg. lines 408-409 can be changed into a single "TST r8, #65536", but the whole expression on lines 408-413 can be changed to "TST r8, #65536; RSBNE r3, r3, #0").
Thanks for all the info. I don't have much experience with assembly language, so I supposed that my code has a lot of weak points.
In NeoDS emulator, the author prepared a lot of routines for Cyclone in ASM, and the performance is great. Is a pity that jEnesisDS source code isn't available, so the only solution to improve Genesis emulator is through PicodriveDS :(
 
Last edited by xonn,

xonn

Well-Known Member
OP
Member
Joined
Jan 11, 2020
Messages
148
Trophies
0
Age
34
XP
893
Country
Spain
I have some questions for those of you who know ARM assembler a little better.
  1. Is there a better way to load "difficult" literals? For example, how would be a good substitution for LDR r0, =0xA0400 ?
  2. Everytime I call a subfunction with BL, I save into stack registers r1 to r3, but them are known as "scratch" registers... Is necessary to do it?
Thanks for your support :)

Edit: Version updated, more info on first post
 
Last edited by xonn,

Aikku

Member
Newcomer
Joined
Jun 12, 2010
Messages
11
Trophies
1
Age
30
Location
Aussie
XP
299
Country
Australia
1. Some literals you can build quickly (for example, A0400h could be the two-cycle sequence "MOV r0, #0xA0000; ORR r0, r0, #0x400"), but if you're taking more than three cycles to do it (or two cycles for ITCM code on ARM9), you just use LDR.
2. By convention, you only need to save r4-r11/fp and r14/lr, and even then only the ones you modify (eg. if your function only modifies r4 and lr/r14, you only need to save those registers). Also generally (on ARM9 at least), you'll want to keep your stack aligned to 8 bytes (64 bits) for LDRD instructions (afaik, it will still work on NDS even when not aligned to 8 bytes, but some other ARM chips will through an exception). Additionally, if you only need to save one register, it's 1c faster to use STR/LDR rather than STMFD/LDMFD afaik.
 
Last edited by Aikku,

Site & Scene News

Popular threads in this forum

General chit-chat
Help Users
  • BigOnYa @ BigOnYa:
    Yea that's what I'm sitting on now- 4.9, and it seems fine, have had no issues at all
  • S @ salazarcosplay:
    I don't know if people play online or such
  • K3Nv2 @ K3Nv2:
    My ps3 short circuited during a deep clean still salty about it after downloading 2tbs worth but SteamDeck okay with emulation still just can't run mgs4 worth shit
  • BigOnYa @ BigOnYa:
    Yea forgot bout trophies. They just silly to me. Just like the xbox achievements. Hey, to each they own tho.
  • K3Nv2 @ K3Nv2:
    It keeps players in touch with the game like a check list of things to do after they beat it
  • S @ salazarcosplay:
    @BigOnYa they ruined the gaming experience for me to be honest
  • S @ salazarcosplay:
    @BigOnYa Im not crazy about getting all of them, i feel like I have something to show for for the time put in
  • S @ salazarcosplay:
    @BigOnYa If you want to do rgh or 360 mod
  • S @ salazarcosplay:
    does it matter if you update your 360 or not before trying is it advisable or not
  • BigOnYa @ BigOnYa:
    Yea I don't pay attention to them really. Or do I try to 100% a game. I just play till story ends/ or I get the girl!
  • K3Nv2 @ K3Nv2:
    Bigonya uses his wiener to mod 360s
    +1
  • Xdqwerty @ Xdqwerty:
    Going to the water park, see ya
  • BigOnYa @ BigOnYa:
    You should update the 360 to newest dash before RGHing it yes. But not a big deal if you don't, you can install new dash/avatar updates after. It's just easier to do it auto online before, instead manual offline after.
  • BigOnYa @ BigOnYa:
    Have fun @Xdqwerty. If you see a chocolate candy bar floating in the water, don't eat it!
  • AncientBoi @ AncientBoi:
    :O:ohnoes: Y didn't U Tell ME that ALSO? @BigOnYa :ohnoes: 🤢🤮
    +1
  • BigOnYa @ BigOnYa:
    Does it taste like... chicken?
    +1
  • S @ salazarcosplay:
    @BigOnYa I wanted to ask you about your experience with seeing south park. Most of the people a bit younger like my younger brother and cousins that are a few younger than me that saw kids found south park funny because of the curse words, kids at school, that seemed like liking the show on a very basic level.

    I could not quite have a in depth discussion of the show.

    How was it for you? As an adult. What did you find the most interesting part about it. Did you relate to the parents of the kids and their situations. Was it satires, the commentary on society. The references on celebrities' and pop culture.
    +1
  • BigOnYa @ BigOnYa:
    I remember seeing the very first episode back in the day, and have watched every episode since. I used to set my VCR to record them even, shows how long ago.
  • BigOnYa @ BigOnYa:
    I just like any comedies really, and cartoons have always been a favorite of mine. Family guy, American Dad, Futurama, Cleveland Show, Simpsons - I like them all.
    +1
  • BigOnYa @ BigOnYa:
    South Park is great cause they always touch on relavent issues going on today, and make something funny out of it.
    +3
  • S @ salazarcosplay:
    @BigOnYa were you always up to date on the current events and issues of the time or were there issues that you first found out thru south park
  • BigOnYa @ BigOnYa:
    Most of the time yea I knew, I watch and read the news regularly, but sometimes the Hollywood BS stuff, like concerning actors slip by me. I don't follow most Hollywood BS (example: the Kardasians)
    +2
  • S @ salazarcosplay:
    @BigOnYa there were relevant issues before south park was made, that's why i think a south park prequel/spinoff would be great. Randy and his friends in their child hood
  • BigOnYa @ BigOnYa:
    Yea, like them running in high school together, getting into stuff, and how they got hitched and had kids. And how the town of South Park was back then compared to now. That would be cool to see.
    BigOnYa @ BigOnYa: Yea, like them running in high school together, getting into stuff, and how they got hitched and...