Hacking Hardware Picofly - a HWFLY switch modchip

binkinator

Garfield’s Fitness Coach
Member
GBAtemp Patron
Joined
Mar 29, 2021
Messages
6,511
Trophies
2
XP
6,155
Country
United States
View attachment 360114View attachment 360115


just playing around with some donor boards what solution should i use ?
Post automatically merged:


yeah thats true bulk is the way to go....
still gonna try and create a pcb and then ill check on it again and see how much bulk is and maybe just buy bulk and offer them for picofly we will see

never used pcbway only jlcpcb and never had issues with them

I like this one:

1679275745152.jpeg

It has that “Pamela Anderson in Barbed Wire” vibe to it.

Updoot if you want this to be @JuanBaNaNa’s April Avitar.

1679276359734.jpeg
 
Last edited by binkinator,

impeeza

¡Kabito!
Member
Joined
Apr 5, 2011
Messages
6,349
Trophies
3
Age
46
Location
At my chair.
XP
18,671
Country
Colombia
I'm pleased to announce that after significant time working on a cycle-accurate (this is very important) emulator I've finally been able to go past the decryption phase and have dumped the segments of ARM code that is written by the end of the encryption. I have to say, this is the most fun CTF I've ever done, although at some point it ran out of steam and couldn't surprise me that much.

My dump isn't perfect--the code it jumps to itself is a tiny bit obfuscated (as in, it copies code to other locations and jumps there to fool IDA's autoanalysis) but as far as I'm concerned the hard part is over. I even have the PIO code(!!!) it writes and executes on the PIO1 state machines.

This is largely a follow-up of my previous post, and I don't want to duplicate information, so if you're confused I'd recommend reading that one first.

Also, big thanks to the people who sent me their firmwares; without them, none of this would have been possible. If you did send me a firmware and would like others to be able to look at what gets dumped, please tell me so I can do that. I can also release my emulator if it helps someone.. it's just 2000 lines of hastily written Zig code, although you will have to manually find some patch addresses so that it works properly.

I'll split this up into sections to avoid spam.

First, and most unimportant, the mysterious SWD message it sends at the very start just wakes it from a dormant state, so not very interesting.

Then there is the decryption. After initializing some data structures (I called them "wordbank0", "wordbank1", and "constant_random_data_waste_of_time") it sets the VTOR to an initial value (e.g. EE2F8D10, which gets truncated to EE2F8D00) and then goes 16 bytes at a time (we'll call this a block) on the binary blob at the base of the SRAM. It also takes the 8-byte board ID and copies it to a 256 byte structure which I call the flash XOR buffer.

In a block, for each byte, it will first derive a key based on the value stored in the VTOR as well as the flash xor buff and a rolling index into it. This value, along with the previous value of the process stack pointer (PSP/SP_Process/whatever ARM calls it) is then written to the current PSP.

This means that some part of the encryption relies on the *UNINITIALIZED* value of the PSP, which is 0xFFFFFFFC, in case you were having troubles with your emulation.

The key is manipulated some more after that. Interestingly, it then sets the flash XOR buff at the selected index to the value of the current encrypted byte. Finally, it XORs the encrypted byte by the key and writes it.

At the end of a block, after all 16 bytes have been written, it takes the PSP shifted right by 8 bits and XORs it by the decrypted byte at byte 15 (that is, the last decrypted byte in the block) and based on if bits 0 through n (n seems to vary across separate firmwares) are set, in a loop of up to n times (the same n!) it calls a function we'll call readWriteOrCall. This will be looked at later.

Finally, it writes a new PSP. All of this can be seen in publicly available firmwares (in this thread) so I don't want to bother going into specifics; it's not that complicated.

We then have readWriteOrCall, which, based on the input (the value of bit n from (PSP >> 8) ^ last_encrypted_byte ^ key) manipulates the previously mentioned word banks (wordbank0/wordbank1) with some division, multiplication, shifts, etc (it sounds complicated but it's simple enough to just F5 and replicate in IDA) until it eventually maybe decides to call the most important function, which I just called executeRWC.

executeRWC is very funny because in the middle of the controlflow graph there is an innocuous branch that loads a value into R0 and then jumps to it. For a while I thought this was where it jumps inside the encrypted blob, but that is wrong. In fact, it never takes that branch. Go figure! Like I said, this was the most interesting CTF I've ever done.

executeRWC is also very important. It has two other (used) features: that it can arbitrarily read or write memory via sequences to a core's SWD. The action taken, the addresses and data used, and which processor to do it on (this is also important) are vaguely derived from the value given to it, which is derived from how the word banks are manipulated, which is derived from the bit setting of the decrypted byte and the PSP, which is certainly a mouthful.

As expected, the writes are mostly used for anti-debug. After initializing the systick (!!!) it spams writes to 0x4001C080 with the value 0x80 -- this is the pad ctrl disable bit for the external SWD pin, which is why it's impossible to debug the rp2040 while it's decrypting the blob. They also periodically read from this register to make sure it's still 0x80.

There are other things it does with the SWD writes, but that's also for later.

SWD reads will read the value and then manipulate the VTOR. Yes, the very same VTOR that is used to derive the encryption key and modify the PSP... meaning that it's essentially a check to see if a memory address contains an expected value.

SWD writes are always done by processor 0, but SWD reads can alternate between 0 and 1 (where 0 is the core executing this code and 1 is the secondary core). However, they are only done by processor 1 when reading 0xE000101C. What is that peripheral, one might ask? It's on the SCS page, but the RP2040 datasheet does not document it. It turns out, of course, that it's documented by the ARMv6-M ARM to be a register holding a the address of a "recently executed" (they deliberately don't define this) instruction.

Because the read is done by core 1, it means they are essentially checking if core 1 is halting at a WFE instruction in the RP2040's bootrom. In other words, they're checking to see if there is any code running on the other core--if you were trying to run e.g. debugging routines on that core but the decryption was failing, this is why. They do this read fairly often, and the value they expect is either 0x180 or 0x174, depending on the bootrom version.

Beyond this they also mostly read the VTOR (self-explanatory) and other SWD comparator registers (which must return 0xFFFFFFFF)... and the systick.

Normally, emulating the systick would be easy, but because it's being read and written by the SWD protocol, we need to keep in mind not only which SWD bit write causes the memory operation (it's the first read done by the 16 bit "turnaround" right after SWCLK is forced high) but also *when* the protocol is permitted to access memory. This is because it must go through the processor core, which can only read/write one address at a time.

After much pain and experimentation I found that both reads and writes dispatch exactly 4 cycles after the instruction that forces SWCLK high.. unless the processor is accessing memory. This means that if an instruction is aligned to 4 bytes, or accesses memory a bunch of times (like POP or LDM), or is 4 bytes wide (like MSR/MRS/BL), the SWD operation will be delayed.

For further example, If an instruction is NOT aligned to 4 bytes and accesses an AHB-lite address, which normally takes 2 cycles, the first cycle will be used to perform the access, and the processor stalls on the second cycle, where the SWD operation can take place. However, if the instruction IS aligned to 4 bytes the first cycle is spent fetching the word it sits on, and the next cycle is used to perform the access, so there is no room and it has to wait for the next instruction. The actual logic is more involved (like with how it interacts with 4 byte instructions and instructions that reference SIO memory) but the bottom line is that getting something to work accurately is not impossible.

..once all of that is implemented, and your emulator properly counts cycles, it's almost smooth sailing from then on. It will keep reading and writing the peripherals mentioned above until a certain point.

Now for the fun part: I imagine the author of the firmware realized that patching the board ID in code was too trivial. To mitigate this, of course, they just.. read it with SWD operations. This requires them to start using PLL_REF, for some reason, (I noticed while replicating this on my Pico that if I didn't use CLK_REF it would freeze.. not sure why) but after that they do the standard reads/writes to 0x18000060 with the message 0x4B.

Obviously an emulator can just see these reads and writes and just return the flash ID, but someone running this on an unintended system will obviously run into trouble.. you could patch the SWD read/write routines and restore the systick, each time, though.

During this, I assume the systick is somewhat unreliable, so they replace a conspicuous global variable pointer to the XOSC COUNT register (which just happens to now run at the same frequency as the SYS clk because it uses the REF PLL) which accesses it in a sequence like:

Code:
LDR Rx, [0x.......] ; gets changed from a random byte to XOSC COUNT address
...
STRB Rn, [Rx] ; in the loop, it stores a part of the VTOR to that address; this is the first access each loop, which explains why they did this.
...
LDRB [Rx] ; first read. if a normal byte address, will be the same value.
          ; if the XOSC count register, will be decremented a bit
...
LDRB [Rx] ; ditto

meaning that in between reads you have to emulate cycle differences. I found that from the write instruction an offset like +3 worked, although at that point I had already counted a 4 cycle delay from writing to that peripheral. Otherwise, it's the same clk counting idea as the systick, but a lot smaller and easier to see if you're doing it right via doing the same on actual hardware.

After verifying the board ID, they reset the sys clk back to what it was and stop using the XOSC COUNT register in the loop.

Doing more of the same typical reads/writes, they eventually read the systick again, which initially threw me off because my value was wrong. Doing exactly the same thing on my Pico revealed that in total the flash accesses add a delay of 52 cycles, which also worked here, thankfully.

Finally, using the SWD interface they will write and execute PIO(!!) programs. The following only omits the typical reads and writes as well as specific writes to SRAM.

-GPIO16 funcsel <= NULL
-writes to PIO0 instr mem starting at 8:

0x6030
0xC010
0x20A0
0x2020
0x6081
0x004A
0xC050
0x6060
0x6020
0x2041
0x00D2
0x203E
0x20BE
0x203E
0x20BE
0x4001
0x0055
0x6020
0x2040
0x00DB
0x203F
0x20BF
0x4001
0x005C

-then, does a bunch of normal reads/writes with varying patterns>

-replaces globalvar pointers in executeRWC with SIO INTERP0_BASE0, INTERP0_BASE1, INTERP0_BASE2
-they dont read any result values from this, so they act like normal memory locations
-reads the value of the registers (with SWD) to make sure they're being manipulated

it then starts writing PIO programs into memory, and executing PIO instructions directly:

RESETS WDSEL <= 0xC00 (bits 10, 11); PIO1, PIO0 (clear bits)
PSM WDSEL <= 0x4000 (bit 14; SIO reset)

it then writes a PIO program to PIO1, again at instruction 8:

swdwrite 0x50300048 <= 0xE021
swdwrite 0x5030004C <= 0xC023
swdwrite 0x50300050 <= 0xA047
swdwrite 0x50300054 <= 0xC3
swdwrite 0x50300058 <= 0x2020
swdwrite 0x5030005C <= 0x20A0
swdwrite 0x50300060 <= 0x84
swdwrite 0x50300064 <= 0x42
swdwrite 0x50300068 <= 0xC040
swdwrite 0x5030006C <= 0x6020
swdwrite 0x50300070 <= 0x6040
swdwrite 0x50300074 <= 0xC023
swdwrite 0x50300078 <= 0xC020
swdwrite 0x5030007C <= 0x4D
swdwrite 0x50300080 <= 0x108E
swdwrite 0x50300084 <= 0xC021
swdwrite 0x50300088 <= 0xA027
swdwrite 0x5030008C <= 0xD1
swdwrite 0x50300090 <= 0x2020
swdwrite 0x50300094 <= 0x20A0
swdwrite 0x50300098 <= 0x52
swdwrite 0x5030009C <= 0xC043
swdwrite 0x503000A0 <= 0xA027
swdwrite 0x503000A4 <= 0xD7
swdwrite 0x503000A8 <= 0x203F
swdwrite 0x503000AC <= 0x20BF
swdwrite 0x503000B0 <= 0x4001
swdwrite 0x503000B4 <= 0x58
swdwrite 0x503000B8 <= 0xA026
swdwrite 0x503000BC <= 0x8020
swdwrite 0x503000C0 <= 0xB6
swdwrite 0x503000C4 <= 0xC041

PIO1 SM0_EXECCTRL <= 0x1C01FB00
PIO1 SM0_SHIFTCTRL <= 0x80010000
PIO1 SM0_PINCTRL <= 0xE0000

then executes the following instructions on PIO1 using the SM0_INSTR register:
swdwrite 0x503000D8 <= 0x16
swdwrite 0x503000D8 <= 0xE042
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xE04F
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xA0E6
swdwrite 0x503000D8 <= 0xA0C3
swdwrite 0x503000D8 <= 0xE041
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xE043
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xE045
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xE041
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xA046

it then enables PIO1 SM0 (sets bit 1 of reg at 0x000).


PIO1 SM2_EXECCTRL <= 0x1D015780
PIO1 SM2_SHIFTCTRL <= 0x90000
PIO1 SM2_PINCTRL <= 0xD8000

it then executes the following instructions on PIO1 using the SM2_INSTR register:
swdwrite 0x50300108 <= 0xE042
swdwrite 0x50300108 <= 0x4044
swdwrite 0x50300108 <= 0xE041
swdwrite 0x50300108 <= 0x4044
swdwrite 0x50300108 <= 0xE040
swdwrite 0x50300108 <= 0x4044
swdwrite 0x50300108 <= 0xA0E6


PIO1 SM1_PINCTRL <= 0x40001E0
it then executes an instruction on PIO1 using the SM1_INSTR register:
swdwrite 0x503000F0 <= 0xE09F

PIO1 SM1_EXECCTRL <= 0xE480
PIO1 SM1_SHIFTCTRL <= 0x60000
PIO1 SM1_PINCTRL <= 0x20003C00

PIO1 SM3_EXECCTRL <= 0x1C008000
PIO1 SM3_SHIFTCTRL <= 0x90000
PIO1 SM3_PINCTRL <= 0xD8000

it then executes the following instructions on PIO1 using the SM3_INSTR register:
swdwrite 0x50300120 <= 0xE042
swdwrite 0x50300120 <= 0x4044
swdwrite 0x50300120 <= 0xE04F
swdwrite 0x50300120 <= 0x4044
swdwrite 0x50300120 <= 0xA0E6

Finally, along with some other innocuous SRAM write, it will jump to the binary blob by overwriting a return address on the stack, ending the encryption algorithm.

I hope this information was helpful to someone, and if you want to know more please do not hesitate to DM me. I've spent a long time working on this and I would be very glad knowing it helped someone.
Man great work!. Thanks a lot for your time and effort
Post automatically merged:

View attachment 360114View attachment 360115


just playing around with some donor boards what solution should i use ?
Post automatically merged:


yeah thats true bulk is the way to go....
still gonna try and create a pcb and then ill check on it again and see how much bulk is and maybe just buy bulk and offer them for picofly we will see

never used pcbway only jlcpcb and never had issues with them
I do love the twisted wirw one is very nice
 

hippy dave

BBMB
Member
Joined
Apr 30, 2012
Messages
9,869
Trophies
2
XP
29,058
Country
United Kingdom
I'm pleased to announce that after significant time working on a cycle-accurate (this is very important) emulator I've finally been able to go past the decryption phase and have dumped the segments of ARM code that is written by the end of the encryption. I have to say, this is the most fun CTF I've ever done, although at some point it ran out of steam and couldn't surprise me that much.

My dump isn't perfect--the code it jumps to itself is a tiny bit obfuscated (as in, it copies code to other locations and jumps there to fool IDA's autoanalysis) but as far as I'm concerned the hard part is over. I even have the PIO code(!!!) it writes and executes on the PIO1 state machines.

This is largely a follow-up of my previous post, and I don't want to duplicate information, so if you're confused I'd recommend reading that one first.

Also, big thanks to the people who sent me their firmwares; without them, none of this would have been possible. If you did send me a firmware and would like others to be able to look at what gets dumped, please tell me so I can do that. I can also release my emulator if it helps someone.. it's just 2000 lines of hastily written Zig code, although you will have to manually find some patch addresses so that it works properly.

I'll split this up into sections to avoid spam.

First, and most unimportant, the mysterious SWD message it sends at the very start just wakes it from a dormant state, so not very interesting.

Then there is the decryption. After initializing some data structures (I called them "wordbank0", "wordbank1", and "constant_random_data_waste_of_time") it sets the VTOR to an initial value (e.g. EE2F8D10, which gets truncated to EE2F8D00) and then goes 16 bytes at a time (we'll call this a block) on the binary blob at the base of the SRAM. It also takes the 8-byte board ID and copies it to a 256 byte structure which I call the flash XOR buffer.

In a block, for each byte, it will first derive a key based on the value stored in the VTOR as well as the flash xor buff and a rolling index into it. This value, along with the previous value of the process stack pointer (PSP/SP_Process/whatever ARM calls it) is then written to the current PSP.

This means that some part of the encryption relies on the *UNINITIALIZED* value of the PSP, which is 0xFFFFFFFC, in case you were having troubles with your emulation.

The key is manipulated some more after that. Interestingly, it then sets the flash XOR buff at the selected index to the value of the current encrypted byte. Finally, it XORs the encrypted byte by the key and writes it.

At the end of a block, after all 16 bytes have been written, it takes the PSP shifted right by 8 bits and XORs it by the decrypted byte at byte 15 (that is, the last decrypted byte in the block) and based on if bits 0 through n (n seems to vary across separate firmwares) are set, in a loop of up to n times (the same n!) it calls a function we'll call readWriteOrCall. This will be looked at later.

Finally, it writes a new PSP. All of this can be seen in publicly available firmwares (in this thread) so I don't want to bother going into specifics; it's not that complicated.

We then have readWriteOrCall, which, based on the input (the value of bit n from (PSP >> 8) ^ last_encrypted_byte ^ key) manipulates the previously mentioned word banks (wordbank0/wordbank1) with some division, multiplication, shifts, etc (it sounds complicated but it's simple enough to just F5 and replicate in IDA) until it eventually maybe decides to call the most important function, which I just called executeRWC.

executeRWC is very funny because in the middle of the controlflow graph there is an innocuous branch that loads a value into R0 and then jumps to it. For a while I thought this was where it jumps inside the encrypted blob, but that is wrong. In fact, it never takes that branch. Go figure! Like I said, this was the most interesting CTF I've ever done.

executeRWC is also very important. It has two other (used) features: that it can arbitrarily read or write memory via sequences to a core's SWD. The action taken, the addresses and data used, and which processor to do it on (this is also important) are vaguely derived from the value given to it, which is derived from how the word banks are manipulated, which is derived from the bit setting of the decrypted byte and the PSP, which is certainly a mouthful.

As expected, the writes are mostly used for anti-debug. After initializing the systick (!!!) it spams writes to 0x4001C080 with the value 0x80 -- this is the pad ctrl disable bit for the external SWD pin, which is why it's impossible to debug the rp2040 while it's decrypting the blob. They also periodically read from this register to make sure it's still 0x80.

There are other things it does with the SWD writes, but that's also for later.

SWD reads will read the value and then manipulate the VTOR. Yes, the very same VTOR that is used to derive the encryption key and modify the PSP... meaning that it's essentially a check to see if a memory address contains an expected value.

SWD writes are always done by processor 0, but SWD reads can alternate between 0 and 1 (where 0 is the core executing this code and 1 is the secondary core). However, they are only done by processor 1 when reading 0xE000101C. What is that peripheral, one might ask? It's on the SCS page, but the RP2040 datasheet does not document it. It turns out, of course, that it's documented by the ARMv6-M ARM to be a register holding a the address of a "recently executed" (they deliberately don't define this) instruction.

Because the read is done by core 1, it means they are essentially checking if core 1 is halting at a WFE instruction in the RP2040's bootrom. In other words, they're checking to see if there is any code running on the other core--if you were trying to run e.g. debugging routines on that core but the decryption was failing, this is why. They do this read fairly often, and the value they expect is either 0x180 or 0x174, depending on the bootrom version.

Beyond this they also mostly read the VTOR (self-explanatory) and other SWD comparator registers (which must return 0xFFFFFFFF)... and the systick.

Normally, emulating the systick would be easy, but because it's being read and written by the SWD protocol, we need to keep in mind not only which SWD bit write causes the memory operation (it's the first read done by the 16 bit "turnaround" right after SWCLK is forced high) but also *when* the protocol is permitted to access memory. This is because it must go through the processor core, which can only read/write one address at a time.

After much pain and experimentation I found that both reads and writes dispatch exactly 4 cycles after the instruction that forces SWCLK high.. unless the processor is accessing memory. This means that if an instruction is aligned to 4 bytes, or accesses memory a bunch of times (like POP or LDM), or is 4 bytes wide (like MSR/MRS/BL), the SWD operation will be delayed.

For further example, If an instruction is NOT aligned to 4 bytes and accesses an AHB-lite address, which normally takes 2 cycles, the first cycle will be used to perform the access, and the processor stalls on the second cycle, where the SWD operation can take place. However, if the instruction IS aligned to 4 bytes the first cycle is spent fetching the word it sits on, and the next cycle is used to perform the access, so there is no room and it has to wait for the next instruction. The actual logic is more involved (like with how it interacts with 4 byte instructions and instructions that reference SIO memory) but the bottom line is that getting something to work accurately is not impossible.

..once all of that is implemented, and your emulator properly counts cycles, it's almost smooth sailing from then on. It will keep reading and writing the peripherals mentioned above until a certain point.

Now for the fun part: I imagine the author of the firmware realized that patching the board ID in code was too trivial. To mitigate this, of course, they just.. read it with SWD operations. This requires them to start using PLL_REF, for some reason, (I noticed while replicating this on my Pico that if I didn't use CLK_REF it would freeze.. not sure why) but after that they do the standard reads/writes to 0x18000060 with the message 0x4B.

Obviously an emulator can just see these reads and writes and just return the flash ID, but someone running this on an unintended system will obviously run into trouble.. you could patch the SWD read/write routines and restore the systick, each time, though.

During this, I assume the systick is somewhat unreliable, so they replace a conspicuous global variable pointer to the XOSC COUNT register (which just happens to now run at the same frequency as the SYS clk because it uses the REF PLL) which accesses it in a sequence like:

Code:
LDR Rx, [0x.......] ; gets changed from a random byte to XOSC COUNT address
...
STRB Rn, [Rx] ; in the loop, it stores a part of the VTOR to that address; this is the first access each loop, which explains why they did this.
...
LDRB [Rx] ; first read. if a normal byte address, will be the same value.
          ; if the XOSC count register, will be decremented a bit
...
LDRB [Rx] ; ditto

meaning that in between reads you have to emulate cycle differences. I found that from the write instruction an offset like +3 worked, although at that point I had already counted a 4 cycle delay from writing to that peripheral. Otherwise, it's the same clk counting idea as the systick, but a lot smaller and easier to see if you're doing it right via doing the same on actual hardware.

After verifying the board ID, they reset the sys clk back to what it was and stop using the XOSC COUNT register in the loop.

Doing more of the same typical reads/writes, they eventually read the systick again, which initially threw me off because my value was wrong. Doing exactly the same thing on my Pico revealed that in total the flash accesses add a delay of 52 cycles, which also worked here, thankfully.

Finally, using the SWD interface they will write and execute PIO(!!) programs. The following only omits the typical reads and writes as well as specific writes to SRAM.

-GPIO16 funcsel <= NULL
-writes to PIO0 instr mem starting at 8:

0x6030
0xC010
0x20A0
0x2020
0x6081
0x004A
0xC050
0x6060
0x6020
0x2041
0x00D2
0x203E
0x20BE
0x203E
0x20BE
0x4001
0x0055
0x6020
0x2040
0x00DB
0x203F
0x20BF
0x4001
0x005C

-then, does a bunch of normal reads/writes with varying patterns>

-replaces globalvar pointers in executeRWC with SIO INTERP0_BASE0, INTERP0_BASE1, INTERP0_BASE2
-they dont read any result values from this, so they act like normal memory locations
-reads the value of the registers (with SWD) to make sure they're being manipulated

it then starts writing PIO programs into memory, and executing PIO instructions directly:

RESETS WDSEL <= 0xC00 (bits 10, 11); PIO1, PIO0 (clear bits)
PSM WDSEL <= 0x4000 (bit 14; SIO reset)

it then writes a PIO program to PIO1, again at instruction 8:

swdwrite 0x50300048 <= 0xE021
swdwrite 0x5030004C <= 0xC023
swdwrite 0x50300050 <= 0xA047
swdwrite 0x50300054 <= 0xC3
swdwrite 0x50300058 <= 0x2020
swdwrite 0x5030005C <= 0x20A0
swdwrite 0x50300060 <= 0x84
swdwrite 0x50300064 <= 0x42
swdwrite 0x50300068 <= 0xC040
swdwrite 0x5030006C <= 0x6020
swdwrite 0x50300070 <= 0x6040
swdwrite 0x50300074 <= 0xC023
swdwrite 0x50300078 <= 0xC020
swdwrite 0x5030007C <= 0x4D
swdwrite 0x50300080 <= 0x108E
swdwrite 0x50300084 <= 0xC021
swdwrite 0x50300088 <= 0xA027
swdwrite 0x5030008C <= 0xD1
swdwrite 0x50300090 <= 0x2020
swdwrite 0x50300094 <= 0x20A0
swdwrite 0x50300098 <= 0x52
swdwrite 0x5030009C <= 0xC043
swdwrite 0x503000A0 <= 0xA027
swdwrite 0x503000A4 <= 0xD7
swdwrite 0x503000A8 <= 0x203F
swdwrite 0x503000AC <= 0x20BF
swdwrite 0x503000B0 <= 0x4001
swdwrite 0x503000B4 <= 0x58
swdwrite 0x503000B8 <= 0xA026
swdwrite 0x503000BC <= 0x8020
swdwrite 0x503000C0 <= 0xB6
swdwrite 0x503000C4 <= 0xC041

PIO1 SM0_EXECCTRL <= 0x1C01FB00
PIO1 SM0_SHIFTCTRL <= 0x80010000
PIO1 SM0_PINCTRL <= 0xE0000

then executes the following instructions on PIO1 using the SM0_INSTR register:
swdwrite 0x503000D8 <= 0x16
swdwrite 0x503000D8 <= 0xE042
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xE04F
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xA0E6
swdwrite 0x503000D8 <= 0xA0C3
swdwrite 0x503000D8 <= 0xE041
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xE043
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xE045
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xE041
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xA046

it then enables PIO1 SM0 (sets bit 1 of reg at 0x000).


PIO1 SM2_EXECCTRL <= 0x1D015780
PIO1 SM2_SHIFTCTRL <= 0x90000
PIO1 SM2_PINCTRL <= 0xD8000

it then executes the following instructions on PIO1 using the SM2_INSTR register:
swdwrite 0x50300108 <= 0xE042
swdwrite 0x50300108 <= 0x4044
swdwrite 0x50300108 <= 0xE041
swdwrite 0x50300108 <= 0x4044
swdwrite 0x50300108 <= 0xE040
swdwrite 0x50300108 <= 0x4044
swdwrite 0x50300108 <= 0xA0E6


PIO1 SM1_PINCTRL <= 0x40001E0
it then executes an instruction on PIO1 using the SM1_INSTR register:
swdwrite 0x503000F0 <= 0xE09F

PIO1 SM1_EXECCTRL <= 0xE480
PIO1 SM1_SHIFTCTRL <= 0x60000
PIO1 SM1_PINCTRL <= 0x20003C00

PIO1 SM3_EXECCTRL <= 0x1C008000
PIO1 SM3_SHIFTCTRL <= 0x90000
PIO1 SM3_PINCTRL <= 0xD8000

it then executes the following instructions on PIO1 using the SM3_INSTR register:
swdwrite 0x50300120 <= 0xE042
swdwrite 0x50300120 <= 0x4044
swdwrite 0x50300120 <= 0xE04F
swdwrite 0x50300120 <= 0x4044
swdwrite 0x50300120 <= 0xA0E6

Finally, along with some other innocuous SRAM write, it will jump to the binary blob by overwriting a return address on the stack, ending the encryption algorithm.

I hope this information was helpful to someone, and if you want to know more please do not hesitate to DM me. I've spent a long time working on this and I would be very glad knowing it helped someone.
Mad skills 😍
 

marhalloweenvt

Well-Known Member
Member
Joined
Oct 2, 2014
Messages
235
Trophies
0
Age
29
XP
921
Country
They can’t even purple silkscreen. They’re useless. /s

Honestly though, at $10/square inch, that’s kind of high.
I used them just for some small pcb design like: GBA filter, GBA/GBC speakers, GBA SP replacement switch,... and the price is acceptable.
1679279916962.png

I don't use theme for purple silkscreen (because where I live, I can order from China with more color of choice and price is affordable for 2-layer pcb design, but with flex one, it's a different story).
 

rehius

Well-Known Member
Member
Joined
Feb 6, 2023
Messages
377
Trophies
1
Age
34
XP
1,789
Country
Canada
I'm pleased to announce that after significant time working on a cycle-accurate (this is very important) emulator I've finally been able to go past the decryption phase and have dumped the segments of ARM code that is written by the end of the encryption. I have to say, this is the most fun CTF I've ever done, although at some point it ran out of steam and couldn't surprise me that much.

My dump isn't perfect--the code it jumps to itself is a tiny bit obfuscated (as in, it copies code to other locations and jumps there to fool IDA's autoanalysis) but as far as I'm concerned the hard part is over. I even have the PIO code(!!!) it writes and executes on the PIO1 state machines.

This is largely a follow-up of my previous post, and I don't want to duplicate information, so if you're confused I'd recommend reading that one first.

Also, big thanks to the people who sent me their firmwares; without them, none of this would have been possible. If you did send me a firmware and would like others to be able to look at what gets dumped, please tell me so I can do that. I can also release my emulator if it helps someone.. it's just 2000 lines of hastily written Zig code, although you will have to manually find some patch addresses so that it works properly.

I'll split this up into sections to avoid spam.

First, and most unimportant, the mysterious SWD message it sends at the very start just wakes it from a dormant state, so not very interesting.

Then there is the decryption. After initializing some data structures (I called them "wordbank0", "wordbank1", and "constant_random_data_waste_of_time") it sets the VTOR to an initial value (e.g. EE2F8D10, which gets truncated to EE2F8D00) and then goes 16 bytes at a time (we'll call this a block) on the binary blob at the base of the SRAM. It also takes the 8-byte board ID and copies it to a 256 byte structure which I call the flash XOR buffer.

In a block, for each byte, it will first derive a key based on the value stored in the VTOR as well as the flash xor buff and a rolling index into it. This value, along with the previous value of the process stack pointer (PSP/SP_Process/whatever ARM calls it) is then written to the current PSP.

This means that some part of the encryption relies on the *UNINITIALIZED* value of the PSP, which is 0xFFFFFFFC, in case you were having troubles with your emulation.

The key is manipulated some more after that. Interestingly, it then sets the flash XOR buff at the selected index to the value of the current encrypted byte. Finally, it XORs the encrypted byte by the key and writes it.

At the end of a block, after all 16 bytes have been written, it takes the PSP shifted right by 8 bits and XORs it by the decrypted byte at byte 15 (that is, the last decrypted byte in the block) and based on if bits 0 through n (n seems to vary across separate firmwares) are set, in a loop of up to n times (the same n!) it calls a function we'll call readWriteOrCall. This will be looked at later.

Finally, it writes a new PSP. All of this can be seen in publicly available firmwares (in this thread) so I don't want to bother going into specifics; it's not that complicated.

We then have readWriteOrCall, which, based on the input (the value of bit n from (PSP >> 8) ^ last_encrypted_byte ^ key) manipulates the previously mentioned word banks (wordbank0/wordbank1) with some division, multiplication, shifts, etc (it sounds complicated but it's simple enough to just F5 and replicate in IDA) until it eventually maybe decides to call the most important function, which I just called executeRWC.

executeRWC is very funny because in the middle of the controlflow graph there is an innocuous branch that loads a value into R0 and then jumps to it. For a while I thought this was where it jumps inside the encrypted blob, but that is wrong. In fact, it never takes that branch. Go figure! Like I said, this was the most interesting CTF I've ever done.

executeRWC is also very important. It has two other (used) features: that it can arbitrarily read or write memory via sequences to a core's SWD. The action taken, the addresses and data used, and which processor to do it on (this is also important) are vaguely derived from the value given to it, which is derived from how the word banks are manipulated, which is derived from the bit setting of the decrypted byte and the PSP, which is certainly a mouthful.

As expected, the writes are mostly used for anti-debug. After initializing the systick (!!!) it spams writes to 0x4001C080 with the value 0x80 -- this is the pad ctrl disable bit for the external SWD pin, which is why it's impossible to debug the rp2040 while it's decrypting the blob. They also periodically read from this register to make sure it's still 0x80.

There are other things it does with the SWD writes, but that's also for later.

SWD reads will read the value and then manipulate the VTOR. Yes, the very same VTOR that is used to derive the encryption key and modify the PSP... meaning that it's essentially a check to see if a memory address contains an expected value.

SWD writes are always done by processor 0, but SWD reads can alternate between 0 and 1 (where 0 is the core executing this code and 1 is the secondary core). However, they are only done by processor 1 when reading 0xE000101C. What is that peripheral, one might ask? It's on the SCS page, but the RP2040 datasheet does not document it. It turns out, of course, that it's documented by the ARMv6-M ARM to be a register holding a the address of a "recently executed" (they deliberately don't define this) instruction.

Because the read is done by core 1, it means they are essentially checking if core 1 is halting at a WFE instruction in the RP2040's bootrom. In other words, they're checking to see if there is any code running on the other core--if you were trying to run e.g. debugging routines on that core but the decryption was failing, this is why. They do this read fairly often, and the value they expect is either 0x180 or 0x174, depending on the bootrom version.

Beyond this they also mostly read the VTOR (self-explanatory) and other SWD comparator registers (which must return 0xFFFFFFFF)... and the systick.

Normally, emulating the systick would be easy, but because it's being read and written by the SWD protocol, we need to keep in mind not only which SWD bit write causes the memory operation (it's the first read done by the 16 bit "turnaround" right after SWCLK is forced high) but also *when* the protocol is permitted to access memory. This is because it must go through the processor core, which can only read/write one address at a time.

After much pain and experimentation I found that both reads and writes dispatch exactly 4 cycles after the instruction that forces SWCLK high.. unless the processor is accessing memory. This means that if an instruction is aligned to 4 bytes, or accesses memory a bunch of times (like POP or LDM), or is 4 bytes wide (like MSR/MRS/BL), the SWD operation will be delayed.

For further example, If an instruction is NOT aligned to 4 bytes and accesses an AHB-lite address, which normally takes 2 cycles, the first cycle will be used to perform the access, and the processor stalls on the second cycle, where the SWD operation can take place. However, if the instruction IS aligned to 4 bytes the first cycle is spent fetching the word it sits on, and the next cycle is used to perform the access, so there is no room and it has to wait for the next instruction. The actual logic is more involved (like with how it interacts with 4 byte instructions and instructions that reference SIO memory) but the bottom line is that getting something to work accurately is not impossible.

..once all of that is implemented, and your emulator properly counts cycles, it's almost smooth sailing from then on. It will keep reading and writing the peripherals mentioned above until a certain point.

Now for the fun part: I imagine the author of the firmware realized that patching the board ID in code was too trivial. To mitigate this, of course, they just.. read it with SWD operations. This requires them to start using PLL_REF, for some reason, (I noticed while replicating this on my Pico that if I didn't use CLK_REF it would freeze.. not sure why) but after that they do the standard reads/writes to 0x18000060 with the message 0x4B.

Obviously an emulator can just see these reads and writes and just return the flash ID, but someone running this on an unintended system will obviously run into trouble.. you could patch the SWD read/write routines and restore the systick, each time, though.

During this, I assume the systick is somewhat unreliable, so they replace a conspicuous global variable pointer to the XOSC COUNT register (which just happens to now run at the same frequency as the SYS clk because it uses the REF PLL) which accesses it in a sequence like:

Code:
LDR Rx, [0x.......] ; gets changed from a random byte to XOSC COUNT address
...
STRB Rn, [Rx] ; in the loop, it stores a part of the VTOR to that address; this is the first access each loop, which explains why they did this.
...
LDRB [Rx] ; first read. if a normal byte address, will be the same value.
          ; if the XOSC count register, will be decremented a bit
...
LDRB [Rx] ; ditto

meaning that in between reads you have to emulate cycle differences. I found that from the write instruction an offset like +3 worked, although at that point I had already counted a 4 cycle delay from writing to that peripheral. Otherwise, it's the same clk counting idea as the systick, but a lot smaller and easier to see if you're doing it right via doing the same on actual hardware.

After verifying the board ID, they reset the sys clk back to what it was and stop using the XOSC COUNT register in the loop.

Doing more of the same typical reads/writes, they eventually read the systick again, which initially threw me off because my value was wrong. Doing exactly the same thing on my Pico revealed that in total the flash accesses add a delay of 52 cycles, which also worked here, thankfully.

Finally, using the SWD interface they will write and execute PIO(!!) programs. The following only omits the typical reads and writes as well as specific writes to SRAM.

-GPIO16 funcsel <= NULL
-writes to PIO0 instr mem starting at 8:

0x6030
0xC010
0x20A0
0x2020
0x6081
0x004A
0xC050
0x6060
0x6020
0x2041
0x00D2
0x203E
0x20BE
0x203E
0x20BE
0x4001
0x0055
0x6020
0x2040
0x00DB
0x203F
0x20BF
0x4001
0x005C

-then, does a bunch of normal reads/writes with varying patterns>

-replaces globalvar pointers in executeRWC with SIO INTERP0_BASE0, INTERP0_BASE1, INTERP0_BASE2
-they dont read any result values from this, so they act like normal memory locations
-reads the value of the registers (with SWD) to make sure they're being manipulated

it then starts writing PIO programs into memory, and executing PIO instructions directly:

RESETS WDSEL <= 0xC00 (bits 10, 11); PIO1, PIO0 (clear bits)
PSM WDSEL <= 0x4000 (bit 14; SIO reset)

it then writes a PIO program to PIO1, again at instruction 8:

swdwrite 0x50300048 <= 0xE021
swdwrite 0x5030004C <= 0xC023
swdwrite 0x50300050 <= 0xA047
swdwrite 0x50300054 <= 0xC3
swdwrite 0x50300058 <= 0x2020
swdwrite 0x5030005C <= 0x20A0
swdwrite 0x50300060 <= 0x84
swdwrite 0x50300064 <= 0x42
swdwrite 0x50300068 <= 0xC040
swdwrite 0x5030006C <= 0x6020
swdwrite 0x50300070 <= 0x6040
swdwrite 0x50300074 <= 0xC023
swdwrite 0x50300078 <= 0xC020
swdwrite 0x5030007C <= 0x4D
swdwrite 0x50300080 <= 0x108E
swdwrite 0x50300084 <= 0xC021
swdwrite 0x50300088 <= 0xA027
swdwrite 0x5030008C <= 0xD1
swdwrite 0x50300090 <= 0x2020
swdwrite 0x50300094 <= 0x20A0
swdwrite 0x50300098 <= 0x52
swdwrite 0x5030009C <= 0xC043
swdwrite 0x503000A0 <= 0xA027
swdwrite 0x503000A4 <= 0xD7
swdwrite 0x503000A8 <= 0x203F
swdwrite 0x503000AC <= 0x20BF
swdwrite 0x503000B0 <= 0x4001
swdwrite 0x503000B4 <= 0x58
swdwrite 0x503000B8 <= 0xA026
swdwrite 0x503000BC <= 0x8020
swdwrite 0x503000C0 <= 0xB6
swdwrite 0x503000C4 <= 0xC041

PIO1 SM0_EXECCTRL <= 0x1C01FB00
PIO1 SM0_SHIFTCTRL <= 0x80010000
PIO1 SM0_PINCTRL <= 0xE0000

then executes the following instructions on PIO1 using the SM0_INSTR register:
swdwrite 0x503000D8 <= 0x16
swdwrite 0x503000D8 <= 0xE042
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xE04F
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xA0E6
swdwrite 0x503000D8 <= 0xA0C3
swdwrite 0x503000D8 <= 0xE041
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xE043
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xE045
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xE041
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xA046

it then enables PIO1 SM0 (sets bit 1 of reg at 0x000).


PIO1 SM2_EXECCTRL <= 0x1D015780
PIO1 SM2_SHIFTCTRL <= 0x90000
PIO1 SM2_PINCTRL <= 0xD8000

it then executes the following instructions on PIO1 using the SM2_INSTR register:
swdwrite 0x50300108 <= 0xE042
swdwrite 0x50300108 <= 0x4044
swdwrite 0x50300108 <= 0xE041
swdwrite 0x50300108 <= 0x4044
swdwrite 0x50300108 <= 0xE040
swdwrite 0x50300108 <= 0x4044
swdwrite 0x50300108 <= 0xA0E6


PIO1 SM1_PINCTRL <= 0x40001E0
it then executes an instruction on PIO1 using the SM1_INSTR register:
swdwrite 0x503000F0 <= 0xE09F

PIO1 SM1_EXECCTRL <= 0xE480
PIO1 SM1_SHIFTCTRL <= 0x60000
PIO1 SM1_PINCTRL <= 0x20003C00

PIO1 SM3_EXECCTRL <= 0x1C008000
PIO1 SM3_SHIFTCTRL <= 0x90000
PIO1 SM3_PINCTRL <= 0xD8000

it then executes the following instructions on PIO1 using the SM3_INSTR register:
swdwrite 0x50300120 <= 0xE042
swdwrite 0x50300120 <= 0x4044
swdwrite 0x50300120 <= 0xE04F
swdwrite 0x50300120 <= 0x4044
swdwrite 0x50300120 <= 0xA0E6

Finally, along with some other innocuous SRAM write, it will jump to the binary blob by overwriting a return address on the stack, ending the encryption algorithm.

I hope this information was helpful to someone, and if you want to know more please do not hesitate to DM me. I've spent a long time working on this and I would be very glad knowing it helped someone.
oh nice, you have reached the second decryption stage
 

flynnsmt4

Member
Newcomer
Joined
Feb 20, 2023
Messages
11
Trophies
0
XP
155
Country
United States
oh nice, you have reached the second decryption stage
I have to say I'm impressed by your work :-)
Post automatically merged:

..and what my luck to realize a bit later that rehius posted a firmware that deliberately ignores the firwmare ID check--in fact, it even cuts short the XIP flash ID sequence. As far as I remember this was the most recently posted one. Attached is the SRAM dump from right after it jumps into the binary blob as well as a txt file containing all of the SWD reads and writes. I just used the ID { 0xE6, 0x60, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 }, if you're wondering.
 

Attachments

  • sram_enters_20012E50.bin.pdf
    264 KB · Views: 42
  • swd info.txt
    15.7 KB · Views: 54
Last edited by flynnsmt4,

binkinator

Garfield’s Fitness Coach
Member
GBAtemp Patron
Joined
Mar 29, 2021
Messages
6,511
Trophies
2
XP
6,155
Country
United States
I have to say I'm impressed by your work :-)
Post automatically merged:

..and what my luck to realize a bit later that rehius posted a firmware that deliberately ignores the firwmare ID check--in fact, it even cuts short the XIP flash ID sequence. As far as I remember this was the most recently posted one. Attached is the SRAM dump from right after it jumps into the binary blob as well as a txt file containing all of the SWD reads and writes. I just used the ID { 0xE6, 0x60, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 }, if you're wondering.
Are you saying the two could technically be pasted together to get an Atmosphere enabled Pico firmware that doesn’t check for the id?
 
  • Wow
  • Like
Reactions: impeeza and peteruk

flynnsmt4

Member
Newcomer
Joined
Feb 20, 2023
Messages
11
Trophies
0
XP
155
Country
United States
Are you saying the two could technically be pasted together to get an Atmosphere enabled Pico firmware that doesn’t check for the id?
AFAIK the only difference is that (other than that one having better eMMC support?) it doesn't check the flash ID; none of them can load Atmosphere.
@flynnsmt4 an emulator? that‘s cheating! :(

jokes aside, great job!
I actually considered just patching it and loading it onto my Pico but I'm glad I didn't given that it executes individual PIO instructions which would obviously be lost if just dumping the SRAM after the fact
 

flynnsmt4

Member
Newcomer
Joined
Feb 20, 2023
Messages
11
Trophies
0
XP
155
Country
United States
Here is another firmware as well as its SRAM dump at the second stage loader, plus all of the SWD reads and writes. The person who gave it to me generously allowed me to publish it here. The board ID is E6 61 1C B7 1F 32 68 29, and the bin with the board ID name is the initial flash (loaded at 0x10000000) while the other bin is the SRAM contents at 0x20000000. Its name contains the address of the first instruction that's executed.

..I'm also being told this firmware is capable of loading atmosphere which means I probably should have been reading this thread more closely.
 

Attachments

  • 20011AF0_info.txt
    11.4 KB · Views: 46
  • E6 61 1C B7 1F 32 68 29.bin.pdf
    91 KB · Views: 42
  • sram_enters_20011AF0.bin.pdf
    264 KB · Views: 45

Dee87

Well-Known Member
Member
Joined
Mar 19, 2023
Messages
1,139
Trophies
1
XP
1,575
Country
Germany
Here is another firmware as well as its SRAM dump at the second stage loader, plus all of the SWD reads and writes. The person who gave it to me generously allowed me to publish it here. The board ID is E6 61 1C B7 1F 32 68 29, and the bin with the board ID name is the initial flash (loaded at 0x10000000) while the other bin is the SRAM contents at 0x20000000. Its name contains the address of the first instruction that's executed.

..I'm also being told this firmware is capable of loading atmosphere which means I probably should have been reading this thread more closely.
Awesome work u guys are doing here.

Wish I could be a magician like u guys :switch:
 
  • Love
Reactions: impeeza

flynnsmt4

Member
Newcomer
Joined
Feb 20, 2023
Messages
11
Trophies
0
XP
155
Country
United States
re @binkinator if someone wanted to try and run the stage 2 loader you could technically reverse the stage 1 loader firmware and replace the decryption+swd writes with the decrypted bin (and doing the same writes), and if it doesn't check the board ID then it will just work. I'd recommend that someone actually reverse engineer it (the stage 2 loader) first though. The stage 1 loader is small enough that doing this isn't really a gargantuan task, you just need to know what you're doing.
 
Last edited by flynnsmt4,

binkinator

Garfield’s Fitness Coach
Member
GBAtemp Patron
Joined
Mar 29, 2021
Messages
6,511
Trophies
2
XP
6,155
Country
United States
re @binkinator if someone wanted to try and run the stage 2 loader you could technically reverse the stage 1 loader firmware and replace the decryption+swd writes with the decrypted bin (and doing the same writes), and if it doesn't check the board ID then it will just work. I'd recommend that someone actually reverse engineer it (the stage 2 loader) first though. The stage 1 loader is small enough that doing this isn't really a gargantuan task, you just need to know what you're doing.

Makes sense. Really appreciate the solid roadmap that’s been laid out here.
 
  • Love
Reactions: impeeza

Site & Scene News

Popular threads in this forum

General chit-chat
Help Users
  • No one is chatting at the moment.
    K3Nv2 @ K3Nv2: Least they got head in the end