Homebrew ARM9 instruction set

bayleef

Well-Known Member
OP
Newcomer
Joined
Sep 15, 2015
Messages
83
Trophies
0
XP
254
Country
Gambia, The
I want to write very fast code for the ARM9 CPU.
At the moment I am using gcc inline assembler.
However, following the BRAHMA example Makefiles, the Thumb-16 instruction set is being used.
Wikipedia tells me:
Embedded hardware, such as the Game Boy Advance, typically have a small amount of RAM accessible with a full 32-bit datapath; the majority is accessed via a 16-bit or narrower secondary datapath. In this situation, it usually makes sense to compile Thumb code and hand-optimise a few of the most CPU-intensive sections using full 32-bit ARM instructions, placing these wider instructions into the 32-bit bus accessible memory.
So, which memory regions can I use for 32-bit ARM instructions?
Can I still use inline asm, or should I create some seperate asm files?
How could I tell gcc to assemble and/or link everything correctly?

I have calculated that my code would run more than 33% faster if I could use the complete ARM9 instruction set.
 
Last edited by bayleef,

bayleef

Well-Known Member
OP
Newcomer
Joined
Sep 15, 2015
Messages
83
Trophies
0
XP
254
Country
Gambia, The
Do you need that optimization for any purpose?
Yes. I want to play around with timing attacks against the AES hardware. I hope that I could read out the keyslots this way. At the moment, I cannot measure elapsed time with sufficient accuracy, because my counting loop is too slow.
 
Last edited by bayleef,

Urbanshadow

Well-Known Member
Member
Joined
Oct 16, 2015
Messages
1,578
Trophies
0
Age
33
XP
1,723
Country
Well, ARM9 can't use thumb16 instructions afaik. In ARM11 you can force a switch between modes with BX. I won't take for granted if thumb-16 is any faster than regular ARM code but if you say so I'll believe you. You can learn more of the instruction set changing in the ARM assembler documentation online.

You should really point to a9lh implementation as it is a direct execution of raw arm9 code with 32bit instruction set, so it gets exactly where you want. If need the full sandbox running i'll stay with brahma.
 
Last edited by Urbanshadow,
  • Like
Reactions: astronautlevel

bayleef

Well-Known Member
OP
Newcomer
Joined
Sep 15, 2015
Messages
83
Trophies
0
XP
254
Country
Gambia, The
Unfortunately, I'm not very familar with the ARM9 architecture. According to wikipedia, ARM9 should have both instruction sets.
However, what about memory access? If for some regions only 16bit can be transfered to the CPU at once, will 32bit instruction set slow down the execution when located at the wrong region?
I would prefer running my code from BRAHMA II, because reading out keyslots may save people from downgrading to 2.1 to install a9lh (on N3DS).
However, since I have successfully dumped my OTP (but not installed a9lh yet), a9lh is at least an option to start with.
I thought, development of a9lh-payloads is very similar to BRAHMA (see this thread), with the difference that BRAHMA cannot access locked registers. So, am I wrong? What are the other differences? Which sandbox do you mean? At the moment I only need user interaction (buttons, display) and access to AES hardware.

Edit: I guess the BX, BLX instructions and the .arm, .thumb directives could do the job (not tested yet, but it can be assembled at least). However, the following question is left open: Can the code be loaded faster when put at a specific memory region (see the Wikipedia entry I cited in the first post)?
 
Last edited by bayleef,

EmuAGR

Well-Known Member
Member
Joined
Jan 11, 2016
Messages
205
Trophies
0
Age
31
XP
246
Country
Thumb instruction set should be faster when memory bus has 16 bits, so it doesn't have to read from memory twice. But I think bus width of the 3DS might be 32 bits, so the complete ARM9 instruction set should be preferable.
 

Urbanshadow

Well-Known Member
Member
Joined
Oct 16, 2015
Messages
1,578
Trophies
0
Age
33
XP
1,723
Country
Thumb instruction set should be faster when memory bus has 16 bits, so it doesn't have to read from memory twice. But I think bus width of the 3DS might be 32 bits, so the complete ARM9 instruction set should be preferable.

The bus is indeed 32 bits, but the half word instruction set provides half the memory footprint. Two half word instructions are cached at once and executed sequentially, if I'm not mistaken. It makes no difference to execution times unless you hit the cache limit afaik.

I guess the BX, BLX instructions and the .arm, .thumb directives could do the job (not tested yet, but it can be assembled at least). However, the following question is left open: Can the code be loaded faster when put at a specific memory region (see the Wikipedia entry I cited in the first post)?

Check how the physical addresses in the 3ds memory space matches each one of the system memories and choose one.
 
Last edited by Urbanshadow,
  • Like
Reactions: EmuAGR

WulfyStylez

SALT/Bemani Princess
Member
Joined
Nov 3, 2013
Messages
1,149
Trophies
0
XP
2,867
Country
United States
Yes. I want to play around with timing attacks against the AES hardware. I hope that I could read out the keyslots this way. At the moment, I cannot measure elapsed time with sufficient accuracy, because my counting loop is too slow.
If you're running cached (which you need! to be for cycle-level accuracy) then your timing loop will be one instruction per clock, so long as your vars are in registers and not on the stack or whatever. That's for ARM and THUMB. If you actually want to do anything besides idle looping, the TIMER regs will give you a count accurate to 1/2 the clock speed of the CPU. Also, check out the timer implementation in libnds for using two 16-bit timers as one 32-bit one, if you need that.
 

bayleef

Well-Known Member
OP
Newcomer
Joined
Sep 15, 2015
Messages
83
Trophies
0
XP
254
Country
Gambia, The
If you're running cached (which you need! to be for cycle-level accuracy) then your timing loop will be one instruction per clock, so long as your vars are in registers and not on the stack or whatever. That's for ARM and THUMB. If you actually want to do anything besides idle looping, the TIMER regs will give you a count accurate to 1/2 the clock speed of the CPU. Also, check out the timer implementation in libnds for using two 16-bit timers as one 32-bit one, if you need that.
Oh, thank you for that hint! I'll definitively take a look at the TIMER registers.

My code was something like
while(!AES output ready) {
count++;
}
which took 8 cycles for Thumb. For Arm, I have used a bunch of SUMEQ commands, which should be 5 cycles per increment (ldr: 3, and: 1, add: 1).
 
Last edited by bayleef,

Site & Scene News

Popular threads in this forum

General chit-chat
Help Users
  • No one is chatting at the moment.
    SylverReZ @ SylverReZ: https://www.youtube.com/watch?v=TKyNPg7UIIc