ARM9 instruction set

Discussion in '3DS - Homebrew Development and Emulators' started by bayleef, Apr 13, 2016.

  1. bayleef
    OP

    bayleef Advanced Member

    Newcomer
    83
    147
    Sep 15, 2015
    Gambia, The
    I want to write very fast code for the ARM9 CPU.
    At the moment I am using gcc inline assembler.
    However, following the BRAHMA example Makefiles, the Thumb-16 instruction set is being used.
    Wikipedia tells me:
    So, which memory regions can I use for 32-bit ARM instructions?
    Can I still use inline asm, or should I create some seperate asm files?
    How could I tell gcc to assemble and/or link everything correctly?

    I have calculated that my code would run more than 33% faster if I could use the complete ARM9 instruction set.
     
    Last edited by bayleef, Apr 13, 2016
  2. EmuAGR

    EmuAGR GBAtemp Regular

    Member
    198
    118
    Jan 11, 2016
    Do you need that optimization for any purpose?
     
  3. bayleef
    OP

    bayleef Advanced Member

    Newcomer
    83
    147
    Sep 15, 2015
    Gambia, The
    Yes. I want to play around with timing attacks against the AES hardware. I hope that I could read out the keyslots this way. At the moment, I cannot measure elapsed time with sufficient accuracy, because my counting loop is too slow.
     
    Last edited by bayleef, Apr 13, 2016
  4. Urbanshadow

    Urbanshadow GBAtemp Maniac

    Member
    1,299
    476
    Oct 16, 2015
    Well, ARM9 can't use thumb16 instructions afaik. In ARM11 you can force a switch between modes with BX. I won't take for granted if thumb-16 is any faster than regular ARM code but if you say so I'll believe you. You can learn more of the instruction set changing in the ARM assembler documentation online.

    You should really point to a9lh implementation as it is a direct execution of raw arm9 code with 32bit instruction set, so it gets exactly where you want. If need the full sandbox running i'll stay with brahma.
     
    Last edited by Urbanshadow, Apr 13, 2016
    astronautlevel likes this.
  5. bayleef
    OP

    bayleef Advanced Member

    Newcomer
    83
    147
    Sep 15, 2015
    Gambia, The
    Unfortunately, I'm not very familar with the ARM9 architecture. According to wikipedia, ARM9 should have both instruction sets.
    However, what about memory access? If for some regions only 16bit can be transfered to the CPU at once, will 32bit instruction set slow down the execution when located at the wrong region?
    I would prefer running my code from BRAHMA II, because reading out keyslots may save people from downgrading to 2.1 to install a9lh (on N3DS).
    However, since I have successfully dumped my OTP (but not installed a9lh yet), a9lh is at least an option to start with.
    I thought, development of a9lh-payloads is very similar to BRAHMA (see this thread), with the difference that BRAHMA cannot access locked registers. So, am I wrong? What are the other differences? Which sandbox do you mean? At the moment I only need user interaction (buttons, display) and access to AES hardware.

    Edit: I guess the BX, BLX instructions and the .arm, .thumb directives could do the job (not tested yet, but it can be assembled at least). However, the following question is left open: Can the code be loaded faster when put at a specific memory region (see the Wikipedia entry I cited in the first post)?
     
    Last edited by bayleef, Apr 13, 2016
  6. EmuAGR

    EmuAGR GBAtemp Regular

    Member
    198
    118
    Jan 11, 2016
    Thumb instruction set should be faster when memory bus has 16 bits, so it doesn't have to read from memory twice. But I think bus width of the 3DS might be 32 bits, so the complete ARM9 instruction set should be preferable.
     
  7. Urbanshadow

    Urbanshadow GBAtemp Maniac

    Member
    1,299
    476
    Oct 16, 2015
    The bus is indeed 32 bits, but the half word instruction set provides half the memory footprint. Two half word instructions are cached at once and executed sequentially, if I'm not mistaken. It makes no difference to execution times unless you hit the cache limit afaik.

    Check how the physical addresses in the 3ds memory space matches each one of the system memories and choose one.
     
    Last edited by Urbanshadow, Apr 13, 2016
    EmuAGR likes this.
  8. WulfyStylez

    WulfyStylez SALT/Bemani Princess

    Member
    1,149
    2,609
    Nov 3, 2013
    United States
    If you're running cached (which you need! to be for cycle-level accuracy) then your timing loop will be one instruction per clock, so long as your vars are in registers and not on the stack or whatever. That's for ARM and THUMB. If you actually want to do anything besides idle looping, the TIMER regs will give you a count accurate to 1/2 the clock speed of the CPU. Also, check out the timer implementation in libnds for using two 16-bit timers as one 32-bit one, if you need that.
     
  9. bayleef
    OP

    bayleef Advanced Member

    Newcomer
    83
    147
    Sep 15, 2015
    Gambia, The
    Oh, thank you for that hint! I'll definitively take a look at the TIMER registers.

    My code was something like
    while(!AES output ready) {
    count++;
    }
    which took 8 cycles for Thumb. For Arm, I have used a bunch of SUMEQ commands, which should be 5 cycles per increment (ldr: 3, and: 1, add: 1).
     
    Last edited by bayleef, Apr 13, 2016