1. xonn

    OP xonn GBAtemp Regular
    Member

    Joined:
    Jan 11, 2020
    Messages:
    139
    Country:
    Spain
    Hello everyone.
    I'm "fighting" against PicodriveDS code in order to improve it performance.
    The main reason is to get a decent Genesis emulator that could be played in DS bottom screen. Now I'm a bit noob coding in ASM, so I'm trying to port some easy functions to a .S file.
    After some changes, I have discovered that FPS are not significantly increased, so I'm starting to think that the problem could be on one or more of those parts:
    • Cyclone 68000 emulator originally used was v0.084, and there's a bit updated version: v0.088. Maybe an update could improve performance? The problem is that I can't get a compiled version that works with PicodriveDS
    • Port more functions to ASM, specially all related to pixel process
    • Maybe process frame in another way? Now, it seems that the program process each horizontal line, changing byte color info format (Genesis format to DS format) and printing the result at the end.
    • The program was written using Devkitarm v20. Anyone knows if updating the code to a newer version would improve the performance?
    All changes can be checked in following github. All help and ideas are welcomed :)

    Update 1:
    Coding in ASM is a nightmare...
    Anyway, there are visible improvements:
    - Frameskip now is set from 4 to 3 and gameplay has not been affected (I think...)
    - All new ASM functions are stored in Functions_asm.s

    Update 2:
    Coding in ASM is still a nightmare :(
    - More functions coded and now the current frameskip is smoother (maybe in the future we could use this free cpu time to activate sound?)
    - Function DmaFill is broken... anybody can check it, I have spent hours and the problem has not been located, so it has been renamed to DmaFill_fail and uncomment the old one in VideoPort.c

    Update 3:
    - DrawAllSprites and DrawSprite has been re-coded in ASM. DrawAllSprites is called in every frame and DrawSprite is called individually for each sprite. However, there's not an important change in FPS...
    - Version will be 0.1.8 from now :yay:

    Update 4:
    - All code has been replaced with an adapted TwilightMenu version of Picodrive. Now it works independently of TWL.
    - Some ASM have been undo. Sadly, my ASM code style sometimes is slower than C code :(
    - Anybody knows why line 1053 of main.cpp now returns data abort when rom is changed? I have commented the line, but I'm not sure if it's a good idea...

    Update 5:
    - UpdatePalette functions has been called just before draw every frame, but palette changes less frequently. Now it's invoked only when there are more than 9 bytes changed in CRAM. With this change, FPS have been increased slightly, but not at any cost: the update time of the color palette shows inconsistencies for an instant when a scene changes.

    Update 6:
    - PicoRead8 function coded in ASM
    - OtherRead16 function coded in ASM
    - Damn... FPS are stucked in 45-50 (max frameskip=2) in a NDSLite for Sonic1 :cry:
    - Damn again... Flashback game doesn't start after the developer logo

    Update 7:
    - Memory ARM functions now are in a separate .S file called: Memory_asm.s
    - PicoRead8, PicoRead16 and PicoRead32 functions coded in ASM. The code could be improved, so everybody is invited to check it
    - In a DSi console, the extra power help some games to run smooootly (55-60fps) :)
    - UpdatePalette function now is called each 15 vblanks. Fade effects are not the bests, but we save some cycles.

    To try it by yourself, download NDS file from master branch, not the release.
     
    Last edited by xonn, Sep 13, 2020
    banjo2, Chainhunter, KiiWii and 5 others like this.
  2. xonn

    OP xonn GBAtemp Regular
    Member

    Joined:
    Jan 11, 2020
    Messages:
    139
    Country:
    Spain
    Updated first message with some news...
     
  3. maorninja

    maorninja GBAtemp Advanced Fan
    Member

    Joined:
    Feb 7, 2016
    Messages:
    892
    Country:
    United States
    Why not use PicoDriveTWL as a base? It compiles on the latest devkitarm version
     
    Robz8 likes this.
  4. xonn

    OP xonn GBAtemp Regular
    Member

    Joined:
    Jan 11, 2020
    Messages:
    139
    Country:
    Spain
    I have downloaded lastest version of PicodriveTWL and it doesn't work on a R4 card (original R4 with Wood firmware). Would be great to work with this updated version if it could be enjoyed in DS and DSi consoles :(
     
    Shadow#1 likes this.
  5. maorninja

    maorninja GBAtemp Advanced Fan
    Member

    Joined:
    Feb 7, 2016
    Messages:
    892
    Country:
    United States
    Launch with TWiLight Menu++ instead.
    Also, limit has been dropped to 2.5 for some reason @Robz8 can explain why
     
    Robz8 likes this.
  6. Robz8

    Robz8 Coolest of TWL
    Developer

    Joined:
    Oct 1, 2010
    Messages:
    14,575
    Country:
    United States
    It's because the ARM7 binary has moved to main memory, in order for the emulator to be launched as a CIA.
     
    alexander1970 likes this.
  7. maorninja

    maorninja GBAtemp Advanced Fan
    Member

    Joined:
    Feb 7, 2016
    Messages:
    892
    Country:
    United States
    Do you think you could make separate builds for CIA and homebrew launchers? It would be helpful for DS Lite users, who either way don't have CIA capabilities.
     
  8. xonn

    OP xonn GBAtemp Regular
    Member

    Joined:
    Jan 11, 2020
    Messages:
    139
    Country:
    Spain
    Updated first message with some news...

    Porting all new ASM functions between "old" and "TWL" version should be easy, just comment the C functions, declare them and add Functions_asm.s file inside /source/pico/ folder
     
    Robz8, wariobar and Zense like this.
  9. xonn

    OP xonn GBAtemp Regular
    Member

    Joined:
    Jan 11, 2020
    Messages:
    139
    Country:
    Spain
    Updated first message with some news...
     
  10. xonn

    OP xonn GBAtemp Regular
    Member

    Joined:
    Jan 11, 2020
    Messages:
    139
    Country:
    Spain
    Updated first message with some news...
     
    wariobar and KiiWii like this.
  11. Coto

    Coto -
    Member

    Joined:
    Jun 4, 2010
    Messages:
    2,768
    Country:
    Chile
    ARM processors do hate context switch (doing some OS task, and entering an user function). As a general rule in emulators, sometimes it is better to prevent too many jumps (function re-entrancy) even if it's in TCM memory. Because code still has to be handled by the prefetch unit and caches (if enabled). If code segment is too large it won't fit caches and be retrieved from slower memory. So calling inlined (short code) C code from ITCM can really speed up timing dependant harware functionality and/or emulator pieces (which gets called too often and on exact intervals).

    ARM Assembly:
    If you can rewrite the emulated CPU in assembly it's going to be definitely faster. But doing that requires to know extensive CPU knowledge of both systems (The one emulated and the host).

    Always use simpler (less cycle) ARM opcodes. Try to avoid LDR/STR(x) opcodes if reading inmediate values (hardcoded values) and use instead MOV,ADD,SUB and/or the barrel shifter unit for multiplications or literally anything that involves scaling numbers in steps of 2).

    Code:
    MOV r1, r3, LSR #7 equals r1 = r3/128
    ADD r9,r8,r8,LSL #2 equals r9=r8*5
    
     
    Last edited by Coto, Sep 9, 2020
    wariobar, Robz8 and xonn like this.
  12. xonn

    OP xonn GBAtemp Regular
    Member

    Joined:
    Jan 11, 2020
    Messages:
    139
    Country:
    Spain
    Thanks @Coto for your message. It's really instructive.
    New changes made. Please check emulator and post your thoughts :)
     
    Coto likes this.
  13. Aikku

    Aikku Member
    Newcomer

    Joined:
    Jun 12, 2010
    Messages:
    11
    Country:
    Australia
    I had a look at pico/Functions_asm.s. Much improvement can be made here, but the biggest things that come to mind are:
    1) Refactor expressions to take advantage of the barrel shifter and status register flags. As an extreme example, if you change line 197 to "ADDS r9, r3, #0" (the main point being to clear the carry flag), then lines 200..202 can be changed to a single instruction: "RSC r5, r5, r3, lsl #3".
    2) Since this is ARM9 (at least I'm assuming it is; if this is to be executed on ARM7, ignore the advice related to ARM9), avoid using memory-loaded registers right away (eg. lines 25..26). Using the register that was loaded from memory right away incurs a 1c penalty (for 32bit loads; 2c for 8bit and 16bit reads).
    3) You don't need BX to return on ARM9, even when changing ARM<->THUMB. So if you pushed the link register to the stack, you can pop it straight back into pc to return (eg. "STMFD sp!, {lr} ... LDMFD sp!, {pc}" or "PUSH {lr} ... POP {pc}" in THUMB).
    4) Branches cost at least 3c. So if the opposite condition's code can be made conditional (eg. lines 200-202) and cost 3c or less, do that and avoid the branch. For performance comparison (ignoring cache effects), the code on lines 199-202 takes 3c for the Z=1 path, but 4c for the Z=0 path (average of 3.5c); if it was made conditional, both paths would take 3c.
    5) Similar to point #2, avoid using MUL/MLA results right away on ARM9 (eg. lines 209-210) as these incur a 1c penalty (as an aside, those lines can be combined into "MLA r8, r2, r9, r8"). On a related note, if you know the bounds of your registers, you might have better performance using SMULxy/SMLAxy.
    6) Combine conditional branches (eg. lines 304-307 can be changed to "CMP r5, #80; CMPNE r4, #21; BEQ .endwhile1das")
    7) Use conditionals more freely (eg. lines 408-409 can be changed into a single "TST r8, #65536", but the whole expression on lines 408-413 can be changed to "TST r8, #65536; RSBNE r3, r3, #0").
     
    Last edited by Aikku, Sep 12, 2020
    wariobar likes this.
  14. xonn

    OP xonn GBAtemp Regular
    Member

    Joined:
    Jan 11, 2020
    Messages:
    139
    Country:
    Spain
    Thanks for all the info. I don't have much experience with assembly language, so I supposed that my code has a lot of weak points.
    In NeoDS emulator, the author prepared a lot of routines for Cyclone in ASM, and the performance is great. Is a pity that jEnesisDS source code isn't available, so the only solution to improve Genesis emulator is through PicodriveDS :(
     
    Last edited by xonn, Sep 12, 2020
  15. Shadow#1

    Shadow#1 Wii, 3DS Softmod & Dumpster Diving Expert
    Member

    Joined:
    Nov 21, 2005
    Messages:
    10,248
    Country:
    United States
  16. xonn

    OP xonn GBAtemp Regular
    Member

    Joined:
    Jan 11, 2020
    Messages:
    139
    Country:
    Spain
    For now, there's not a "stable" version, so I'm uploading to github only the nds file. Sorry.
     
    Last edited by xonn, Sep 12, 2020
  17. xonn

    OP xonn GBAtemp Regular
    Member

    Joined:
    Jan 11, 2020
    Messages:
    139
    Country:
    Spain
    I have some questions for those of you who know ARM assembler a little better.
    1. Is there a better way to load "difficult" literals? For example, how would be a good substitution for LDR r0, =0xA0400 ?
    2. Everytime I call a subfunction with BL, I save into stack registers r1 to r3, but them are known as "scratch" registers... Is necessary to do it?
    Thanks for your support :)

    Edit: Version updated, more info on first post
     
    Last edited by xonn, Sep 13, 2020
  18. Aikku

    Aikku Member
    Newcomer

    Joined:
    Jun 12, 2010
    Messages:
    11
    Country:
    Australia
    1. Some literals you can build quickly (for example, A0400h could be the two-cycle sequence "MOV r0, #0xA0000; ORR r0, r0, #0x400"), but if you're taking more than three cycles to do it (or two cycles for ITCM code on ARM9), you just use LDR.
    2. By convention, you only need to save r4-r11/fp and r14/lr, and even then only the ones you modify (eg. if your function only modifies r4 and lr/r14, you only need to save those registers). Also generally (on ARM9 at least), you'll want to keep your stack aligned to 8 bytes (64 bits) for LDRD instructions (afaik, it will still work on NDS even when not aligned to 8 bytes, but some other ARM chips will through an exception). Additionally, if you only need to save one register, it's 1c faster to use STR/LDR rather than STMFD/LDMFD afaik.
     
    Last edited by Aikku, Sep 13, 2020
  19. xonn

    OP xonn GBAtemp Regular
    Member

    Joined:
    Jan 11, 2020
    Messages:
    139
    Country:
    Spain
    Updates! And for today it's enough :wacko:
     
    Zense and banjo2 like this.
  20. Robz8

    Robz8 Coolest of TWL
    Developer

    Joined:
    Oct 1, 2010
    Messages:
    14,575
    Country:
    United States
    Seems to work great!
    Any chance you can make a PR for PicoDriveTWL featuring your changes?
     
    wariobar and banjo2 like this.
Draft saved Draft deleted
Loading...

Hide similar threads Similar threads with keywords - PicodriveDS, Challenge, Improve