Hacking Hardware Picofly - a HWFLY switch modchip

Dee87

Well-Known Member
Member
Joined
Mar 19, 2023
Messages
1,139
Trophies
1
XP
1,591
Country
Germany
sometimes the failures are just as relevant as the successes.

If you do put one together I would steal a bit of @rehius’s solder work…a straight bar across the bottom and top of the caps.


Hang the mosfet off to the side and on the other end put some nice fat pads to solder to the chip and you’ll have a winner. :-)
i will see what i can do later :-)
when im done i will upload so everyone can have a look at it and maybe make some changes together :-)
 

impeeza

¡Kabito!
Member
Joined
Apr 5, 2011
Messages
7,317
Trophies
4
Age
46
Location
At my chair.
XP
23,483
Country
Colombia

Dee87

Well-Known Member
Member
Joined
Mar 19, 2023
Messages
1,139
Trophies
1
XP
1,591
Country
Germany
we could get them made at jlcpcb orso with even faster shipping and prob cheaper then aliexpress ?

just checked a few sites and i dont think we can get them cheaper XD
just checked with the fpcb wich came from ur adventure Xd

they where like for 10 peaces 112$
so might still have to order them from ali :-(

im gonna still try to create a new flex cable, still having issues with finding the footprint packages for the pqfn 2mm*2mm
but we wil see B-)
 
  • Love
Reactions: impeeza

binkinator

Garfield’s Fitness Coach
Member
GBAtemp Patron
Joined
Mar 29, 2021
Messages
6,511
Trophies
2
XP
6,191
Country
United States
just checked a few sites and i dont think we can get them cheaper XD
just checked with the fpcb wich came from ur adventure Xd

they where like for 10 peaces 112$
so might still have to order them from ali :-(

im gonna still try to create a new flex cable, still having issues with finding the footprint packages for the pqfn 2mm*2mm
but we wil see B-)
Yeah, they probably did a batch run of the simple cable. The flex cables are super cheap. Not even the same quality you’d get from an FPC you‘d get from jlcpcb. Going to be hard to beat.
 
  • Like
Reactions: impeeza and Dee87

Dee87

Well-Known Member
Member
Joined
Mar 19, 2023
Messages
1,139
Trophies
1
XP
1,591
Country
Germany
yeah probally , never ordern FPC to be honest only normal pcbs and they where always pretty cheap
but atleast we know better now :-)

so for now imma stick with the wire till i order some flex cable .
anyone know some place in the eu where u can order them from ?
 
  • Love
Reactions: binkinator

impeeza

¡Kabito!
Member
Joined
Apr 5, 2011
Messages
7,317
Trophies
4
Age
46
Location
At my chair.
XP
23,483
Country
Colombia
Yeah, they probably did a batch run of the simple cable. The flex cables are super cheap. Not even the same quality you’d get from an FPC you‘d get from jlcpcb. Going to be hard to beat.
you never will beat the prices when you do it at whole, if you want a piece that could be priced the same as 10 units when you order 100 at same time.
 

binkinator

Garfield’s Fitness Coach
Member
GBAtemp Patron
Joined
Mar 29, 2021
Messages
6,511
Trophies
2
XP
6,191
Country
United States
you never will beat the prices when you do it at whole, if you want a piece that could be priced the same as 10 units when you order 100 at same time.

I found out when I was putting together my double battery mod. I only wanted one.

Let‘s just say the price was…shocking :-O
 
  • Haha
Reactions: impeeza and Dee87

impeeza

¡Kabito!
Member
Joined
Apr 5, 2011
Messages
7,317
Trophies
4
Age
46
Location
At my chair.
XP
23,483
Country
Colombia
now what you people triggered my memory, back on the day I used to create prototype PCBs using all copper boards, PCB Traces printed on laser printer, transfer the ink to the copper board using an fabric iron, and then iron-sulfide to remove unprotected copper, that was cheaper as hell but take time, I have some microcontroller developer boards and programmers still working today.
 

Dee87

Well-Known Member
Member
Joined
Mar 19, 2023
Messages
1,139
Trophies
1
XP
1,591
Country
Germany
now what you people triggered my memory, back on the day I used to create prototype PCBs using all copper boards, PCB Traces printed on laser printer, transfer the ink to the copper board using an fabric iron, and then iron-sulfide to remove unprotected copper, that was cheaper as hell but take time, I have some microcontroller developer boards and programmers still working today.

yeah been there and done that good old days Xd
these days i use my small cnc milling machiene or if i need more and im to lazy i just order them from jlcpcb,
what i also have done is 3d printed some pcbs for the fun but not even worth it hahahahaha
 
  • Haha
Reactions: impeeza

qgywibczozfvvl

Well-Known Member
Newcomer
Joined
Mar 6, 2023
Messages
88
Trophies
0
XP
85
Country
Germany
now what you people triggered my memory, back on the day I used to create prototype PCBs using all copper boards, PCB Traces printed on laser printer, transfer the ink to the copper board using an fabric iron, and then iron-sulfide to remove unprotected copper, that was cheaper as hell but take time, I have some microcontroller developer boards and programmers still working today.
Old man just look at my I am lot like you
 

rcpd

Well-Known Member
Member
Joined
Jan 31, 2023
Messages
617
Trophies
0
Age
55
XP
1,389
Country
United States
I found out when I was putting together my double battery mod. I only wanted one.

Let‘s just say the price was…shocking :-O
I always order in bulk from JLC. If I need one, I’m getting 50/100 because it’s going to be cheaper and there’s a 10% of the batch that just don’t work. I make a lot of Pi HAT’s that are extremely one-off. I’ve got boxes of spares in the basement.

Edit: can anyone comment on PCBWay? They’re generally more expensive unless you get into JLC’s non-stock options. Then JLC gets out of control. Does PCBWay populate? JLC will not, and when you make a bunch of Pi HAT’s, soldering header pins gets old fast.
 
Last edited by rcpd,

Dee87

Well-Known Member
Member
Joined
Mar 19, 2023
Messages
1,139
Trophies
1
XP
1,591
Country
Germany
IMG_20230319_210212.jpg
IMG_20230319_210251.jpg



just playing around with some donor boards what solution should i use ?
Post automatically merged:

I always order in bulk from JLC. If I need one, I’m getting 50/100 because it’s going to be cheaper and there’s a 10% of the batch that just don’t work. I make a lot of Pi HAT’s that are extremely one-off. I’ve got boxes of spares in the basement.

Edit: can anyone comment on PCBWay? They’re generally more expensive unless you get into JLC’s non-stock options. Then JLC gets out of control. Does PCBWay populate? JLC will not, and when you make a bunch of Pi HAT’s, soldering header pins gets old fast.
yeah thats true bulk is the way to go....
still gonna try and create a pcb and then ill check on it again and see how much bulk is and maybe just buy bulk and offer them for picofly we will see

never used pcbway only jlcpcb and never had issues with them
 
Last edited by Dee87,

Magnus Hydra

It’s rare for me to be here.
Member
Joined
Dec 12, 2011
Messages
172
Trophies
1
XP
622
Country
United States
View attachment 360114View attachment 360115


just playing around with some donor boards what solution should i use ?
Post automatically merged:


yeah thats true bulk is the way to go....
still gonna try and create a pcb and then ill check on it again and see how much bulk is and maybe just buy bulk and offer them for picofly we will see

never used pcbway only jlcpcb and never had issues with them
like the flex ribbon idea.
 
  • Like
Reactions: impeeza

Warbeast

Active Member
Newcomer
Joined
Dec 30, 2015
Messages
40
Trophies
0
Age
44
XP
294
Country
I picked up a v2 with bad m92t from ebay to repair and mess with the pico hekate booted 1st try and ofw works fine using vol buttons and using reboot
To ofw option I'm using the 2.5fw

Is it normal for hekate to show the emmc in slow mode or did I just buy one with a bad emmc? It boots to hos fine and hekate but under emmc info I get the slow mode error..
 
  • Love
Reactions: impeeza

TheSynthax

Well-Known Member
Member
Joined
Apr 29, 2018
Messages
223
Trophies
0
XP
540
Country
United States
I picked up a v2 with bad m92t from ebay to repair and mess with the pico hekate booted 1st try and ofw works fine using vol buttons and using reboot
To ofw option I'm using the 2.5fw

Is it normal for hekate to show the emmc in slow mode or did I just buy one with a bad emmc? It boots to hos fine and hekate but under emmc info I get the slow mode error..
Sounds like failing eMMC, I'd make a dump before it goes.
 

rcpd

Well-Known Member
Member
Joined
Jan 31, 2023
Messages
617
Trophies
0
Age
55
XP
1,389
Country
United States
I picked up a v2 with bad m92t from ebay to repair and mess with the pico hekate booted 1st try and ofw works fine using vol buttons and using reboot
To ofw option I'm using the 2.5fw

Is it normal for hekate to show the emmc in slow mode or did I just buy one with a bad emmc? It boots to hos fine and hekate but under emmc info I get the slow mode error..
In Hekate, go to Console Info and check the eMMC. If it fails to init, you should be making a backup as soon as possible. That eMMC is likely failing.
 
  • Like
Reactions: impeeza

flynnsmt4

Member
Newcomer
Joined
Feb 20, 2023
Messages
11
Trophies
0
XP
155
Country
United States
I'm pleased to announce that after significant time working on a cycle-accurate (this is very important) emulator I've finally been able to go past the decryption phase and have dumped the segments of ARM code that is written by the end of the encryption. I have to say, this is the most fun CTF I've ever done, although at some point it ran out of steam and couldn't surprise me that much.

My dump isn't perfect--the code it jumps to itself is a tiny bit obfuscated (as in, it copies code to other locations and jumps there to fool IDA's autoanalysis) but as far as I'm concerned the hard part is over. I even have the PIO code(!!!) it writes and executes on the PIO1 state machines.

This is largely a follow-up of my previous post, and I don't want to duplicate information, so if you're confused I'd recommend reading that one first.

Also, big thanks to the people who sent me their firmwares; without them, none of this would have been possible. If you did send me a firmware and would like others to be able to look at what gets dumped, please tell me so I can do that. I can also release my emulator if it helps someone.. it's just 2000 lines of hastily written Zig code, although you will have to manually find some patch addresses so that it works properly.

I'll split this up into sections to avoid spam.

First, and most unimportant, the mysterious SWD message it sends at the very start just wakes it from a dormant state, so not very interesting.

Then there is the decryption. After initializing some data structures (I called them "wordbank0", "wordbank1", and "constant_random_data_waste_of_time") it sets the VTOR to an initial value (e.g. EE2F8D10, which gets truncated to EE2F8D00) and then goes 16 bytes at a time (we'll call this a block) on the binary blob at the base of the SRAM. It also takes the 8-byte board ID and copies it to a 256 byte structure which I call the flash XOR buffer.

In a block, for each byte, it will first derive a key based on the value stored in the VTOR as well as the flash xor buff and a rolling index into it. This value, along with the previous value of the process stack pointer (PSP/SP_Process/whatever ARM calls it) is then written to the current PSP.

This means that some part of the encryption relies on the *UNINITIALIZED* value of the PSP, which is 0xFFFFFFFC, in case you were having troubles with your emulation.

The key is manipulated some more after that. Interestingly, it then sets the flash XOR buff at the selected index to the value of the current encrypted byte. Finally, it XORs the encrypted byte by the key and writes it.

At the end of a block, after all 16 bytes have been written, it takes the PSP shifted right by 8 bits and XORs it by the decrypted byte at byte 15 (that is, the last decrypted byte in the block) and based on if bits 0 through n (n seems to vary across separate firmwares) are set, in a loop of up to n times (the same n!) it calls a function we'll call readWriteOrCall. This will be looked at later.

Finally, it writes a new PSP. All of this can be seen in publicly available firmwares (in this thread) so I don't want to bother going into specifics; it's not that complicated.

We then have readWriteOrCall, which, based on the input (the value of bit n from (PSP >> 8) ^ last_encrypted_byte ^ key) manipulates the previously mentioned word banks (wordbank0/wordbank1) with some division, multiplication, shifts, etc (it sounds complicated but it's simple enough to just F5 and replicate in IDA) until it eventually maybe decides to call the most important function, which I just called executeRWC.

executeRWC is very funny because in the middle of the controlflow graph there is an innocuous branch that loads a value into R0 and then jumps to it. For a while I thought this was where it jumps inside the encrypted blob, but that is wrong. In fact, it never takes that branch. Go figure! Like I said, this was the most interesting CTF I've ever done.

executeRWC is also very important. It has two other (used) features: that it can arbitrarily read or write memory via sequences to a core's SWD. The action taken, the addresses and data used, and which processor to do it on (this is also important) are vaguely derived from the value given to it, which is derived from how the word banks are manipulated, which is derived from the bit setting of the decrypted byte and the PSP, which is certainly a mouthful.

As expected, the writes are mostly used for anti-debug. After initializing the systick (!!!) it spams writes to 0x4001C080 with the value 0x80 -- this is the pad ctrl disable bit for the external SWD pin, which is why it's impossible to debug the rp2040 while it's decrypting the blob. They also periodically read from this register to make sure it's still 0x80.

There are other things it does with the SWD writes, but that's also for later.

SWD reads will read the value and then manipulate the VTOR. Yes, the very same VTOR that is used to derive the encryption key and modify the PSP... meaning that it's essentially a check to see if a memory address contains an expected value.

SWD writes are always done by processor 0, but SWD reads can alternate between 0 and 1 (where 0 is the core executing this code and 1 is the secondary core). However, they are only done by processor 1 when reading 0xE000101C. What is that peripheral, one might ask? It's on the SCS page, but the RP2040 datasheet does not document it. It turns out, of course, that it's documented by the ARMv6-M ARM to be a register holding a the address of a "recently executed" (they deliberately don't define this) instruction.

Because the read is done by core 1, it means they are essentially checking if core 1 is halting at a WFE instruction in the RP2040's bootrom. In other words, they're checking to see if there is any code running on the other core--if you were trying to run e.g. debugging routines on that core but the decryption was failing, this is why. They do this read fairly often, and the value they expect is either 0x180 or 0x174, depending on the bootrom version.

Beyond this they also mostly read the VTOR (self-explanatory) and other SWD comparator registers (which must return 0xFFFFFFFF)... and the systick.

Normally, emulating the systick would be easy, but because it's being read and written by the SWD protocol, we need to keep in mind not only which SWD bit write causes the memory operation (it's the first read done by the 16 bit "turnaround" right after SWCLK is forced high) but also *when* the protocol is permitted to access memory. This is because it must go through the processor core, which can only read/write one address at a time.

After much pain and experimentation I found that both reads and writes dispatch exactly 4 cycles after the instruction that forces SWCLK high.. unless the processor is accessing memory. This means that if an instruction is aligned to 4 bytes, or accesses memory a bunch of times (like POP or LDM), or is 4 bytes wide (like MSR/MRS/BL), the SWD operation will be delayed.

For further example, If an instruction is NOT aligned to 4 bytes and accesses an AHB-lite address, which normally takes 2 cycles, the first cycle will be used to perform the access, and the processor stalls on the second cycle, where the SWD operation can take place. However, if the instruction IS aligned to 4 bytes the first cycle is spent fetching the word it sits on, and the next cycle is used to perform the access, so there is no room and it has to wait for the next instruction. The actual logic is more involved (like with how it interacts with 4 byte instructions and instructions that reference SIO memory) but the bottom line is that getting something to work accurately is not impossible.

..once all of that is implemented, and your emulator properly counts cycles, it's almost smooth sailing from then on. It will keep reading and writing the peripherals mentioned above until a certain point.

Now for the fun part: I imagine the author of the firmware realized that patching the board ID in code was too trivial. To mitigate this, of course, they just.. read it with SWD operations. This requires them to start using PLL_REF, for some reason, (I noticed while replicating this on my Pico that if I didn't use CLK_REF it would freeze.. not sure why) but after that they do the standard reads/writes to 0x18000060 with the message 0x4B.

Obviously an emulator can just see these reads and writes and just return the flash ID, but someone running this on an unintended system will obviously run into trouble.. you could patch the SWD read/write routines and restore the systick, each time, though.

During this, I assume the systick is somewhat unreliable, so they replace a conspicuous global variable pointer to the XOSC COUNT register (which just happens to now run at the same frequency as the SYS clk because it uses the REF PLL) which accesses it in a sequence like:

Code:
LDR Rx, [0x.......] ; gets changed from a random byte to XOSC COUNT address
...
STRB Rn, [Rx] ; in the loop, it stores a part of the VTOR to that address; this is the first access each loop, which explains why they did this.
...
LDRB [Rx] ; first read. if a normal byte address, will be the same value.
          ; if the XOSC count register, will be decremented a bit
...
LDRB [Rx] ; ditto

meaning that in between reads you have to emulate cycle differences. I found that from the write instruction an offset like +3 worked, although at that point I had already counted a 4 cycle delay from writing to that peripheral. Otherwise, it's the same clk counting idea as the systick, but a lot smaller and easier to see if you're doing it right via doing the same on actual hardware.

After verifying the board ID, they reset the sys clk back to what it was and stop using the XOSC COUNT register in the loop.

Doing more of the same typical reads/writes, they eventually read the systick again, which initially threw me off because my value was wrong. Doing exactly the same thing on my Pico revealed that in total the flash accesses add a delay of 52 cycles, which also worked here, thankfully.

Finally, using the SWD interface they will write and execute PIO(!!) programs. The following only omits the typical reads and writes as well as specific writes to SRAM.

-GPIO16 funcsel <= NULL
-writes to PIO0 instr mem starting at 8:

0x6030
0xC010
0x20A0
0x2020
0x6081
0x004A
0xC050
0x6060
0x6020
0x2041
0x00D2
0x203E
0x20BE
0x203E
0x20BE
0x4001
0x0055
0x6020
0x2040
0x00DB
0x203F
0x20BF
0x4001
0x005C

-then, does a bunch of normal reads/writes with varying patterns>

-replaces globalvar pointers in executeRWC with SIO INTERP0_BASE0, INTERP0_BASE1, INTERP0_BASE2
-they dont read any result values from this, so they act like normal memory locations
-reads the value of the registers (with SWD) to make sure they're being manipulated

it then starts writing PIO programs into memory, and executing PIO instructions directly:

RESETS WDSEL <= 0xC00 (bits 10, 11); PIO1, PIO0 (clear bits)
PSM WDSEL <= 0x4000 (bit 14; SIO reset)

it then writes a PIO program to PIO1, again at instruction 8:

swdwrite 0x50300048 <= 0xE021
swdwrite 0x5030004C <= 0xC023
swdwrite 0x50300050 <= 0xA047
swdwrite 0x50300054 <= 0xC3
swdwrite 0x50300058 <= 0x2020
swdwrite 0x5030005C <= 0x20A0
swdwrite 0x50300060 <= 0x84
swdwrite 0x50300064 <= 0x42
swdwrite 0x50300068 <= 0xC040
swdwrite 0x5030006C <= 0x6020
swdwrite 0x50300070 <= 0x6040
swdwrite 0x50300074 <= 0xC023
swdwrite 0x50300078 <= 0xC020
swdwrite 0x5030007C <= 0x4D
swdwrite 0x50300080 <= 0x108E
swdwrite 0x50300084 <= 0xC021
swdwrite 0x50300088 <= 0xA027
swdwrite 0x5030008C <= 0xD1
swdwrite 0x50300090 <= 0x2020
swdwrite 0x50300094 <= 0x20A0
swdwrite 0x50300098 <= 0x52
swdwrite 0x5030009C <= 0xC043
swdwrite 0x503000A0 <= 0xA027
swdwrite 0x503000A4 <= 0xD7
swdwrite 0x503000A8 <= 0x203F
swdwrite 0x503000AC <= 0x20BF
swdwrite 0x503000B0 <= 0x4001
swdwrite 0x503000B4 <= 0x58
swdwrite 0x503000B8 <= 0xA026
swdwrite 0x503000BC <= 0x8020
swdwrite 0x503000C0 <= 0xB6
swdwrite 0x503000C4 <= 0xC041

PIO1 SM0_EXECCTRL <= 0x1C01FB00
PIO1 SM0_SHIFTCTRL <= 0x80010000
PIO1 SM0_PINCTRL <= 0xE0000

then executes the following instructions on PIO1 using the SM0_INSTR register:
swdwrite 0x503000D8 <= 0x16
swdwrite 0x503000D8 <= 0xE042
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xE04F
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xA0E6
swdwrite 0x503000D8 <= 0xA0C3
swdwrite 0x503000D8 <= 0xE041
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xE043
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xE045
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xE041
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xA046

it then enables PIO1 SM0 (sets bit 1 of reg at 0x000).


PIO1 SM2_EXECCTRL <= 0x1D015780
PIO1 SM2_SHIFTCTRL <= 0x90000
PIO1 SM2_PINCTRL <= 0xD8000

it then executes the following instructions on PIO1 using the SM2_INSTR register:
swdwrite 0x50300108 <= 0xE042
swdwrite 0x50300108 <= 0x4044
swdwrite 0x50300108 <= 0xE041
swdwrite 0x50300108 <= 0x4044
swdwrite 0x50300108 <= 0xE040
swdwrite 0x50300108 <= 0x4044
swdwrite 0x50300108 <= 0xA0E6


PIO1 SM1_PINCTRL <= 0x40001E0
it then executes an instruction on PIO1 using the SM1_INSTR register:
swdwrite 0x503000F0 <= 0xE09F

PIO1 SM1_EXECCTRL <= 0xE480
PIO1 SM1_SHIFTCTRL <= 0x60000
PIO1 SM1_PINCTRL <= 0x20003C00

PIO1 SM3_EXECCTRL <= 0x1C008000
PIO1 SM3_SHIFTCTRL <= 0x90000
PIO1 SM3_PINCTRL <= 0xD8000

it then executes the following instructions on PIO1 using the SM3_INSTR register:
swdwrite 0x50300120 <= 0xE042
swdwrite 0x50300120 <= 0x4044
swdwrite 0x50300120 <= 0xE04F
swdwrite 0x50300120 <= 0x4044
swdwrite 0x50300120 <= 0xA0E6

Finally, along with some other innocuous SRAM write, it will jump to the binary blob by overwriting a return address on the stack, ending the encryption algorithm.

I hope this information was helpful to someone, and if you want to know more please do not hesitate to DM me. I've spent a long time working on this and I would be very glad knowing it helped someone.
 

binkinator

Garfield’s Fitness Coach
Member
GBAtemp Patron
Joined
Mar 29, 2021
Messages
6,511
Trophies
2
XP
6,191
Country
United States
I'm pleased to announce that after significant time working on a cycle-accurate (this is very important) emulator I've finally been able to go past the decryption phase and have dumped the segments of ARM code that is written by the end of the encryption. I have to say, this is the most fun CTF I've ever done, although at some point it ran out of steam and couldn't surprise me that much.

My dump isn't perfect--the code it jumps to itself is a tiny bit obfuscated (as in, it copies code to other locations and jumps there to fool IDA's autoanalysis) but as far as I'm concerned the hard part is over. I even have the PIO code(!!!) it writes and executes on the PIO1 state machines.

This is largely a follow-up of my previous post, and I don't want to duplicate information, so if you're confused I'd recommend reading that one first.

Also, big thanks to the people who sent me their firmwares; without them, none of this would have been possible. If you did send me a firmware and would like others to be able to look at what gets dumped, please tell me so I can do that. I can also release my emulator if it helps someone.. it's just 2000 lines of hastily written Zig code, although you will have to manually find some patch addresses so that it works properly.

I'll split this up into sections to avoid spam.

First, and most unimportant, the mysterious SWD message it sends at the very start just wakes it from a dormant state, so not very interesting.

Then there is the decryption. After initializing some data structures (I called them "wordbank0", "wordbank1", and "constant_random_data_waste_of_time") it sets the VTOR to an initial value (e.g. EE2F8D10, which gets truncated to EE2F8D00) and then goes 16 bytes at a time (we'll call this a block) on the binary blob at the base of the SRAM. It also takes the 8-byte board ID and copies it to a 256 byte structure which I call the flash XOR buffer.

In a block, for each byte, it will first derive a key based on the value stored in the VTOR as well as the flash xor buff and a rolling index into it. This value, along with the previous value of the process stack pointer (PSP/SP_Process/whatever ARM calls it) is then written to the current PSP.

This means that some part of the encryption relies on the *UNINITIALIZED* value of the PSP, which is 0xFFFFFFFC, in case you were having troubles with your emulation.

The key is manipulated some more after that. Interestingly, it then sets the flash XOR buff at the selected index to the value of the current encrypted byte. Finally, it XORs the encrypted byte by the key and writes it.

At the end of a block, after all 16 bytes have been written, it takes the PSP shifted right by 8 bits and XORs it by the decrypted byte at byte 15 (that is, the last decrypted byte in the block) and based on if bits 0 through n (n seems to vary across separate firmwares) are set, in a loop of up to n times (the same n!) it calls a function we'll call readWriteOrCall. This will be looked at later.

Finally, it writes a new PSP. All of this can be seen in publicly available firmwares (in this thread) so I don't want to bother going into specifics; it's not that complicated.

We then have readWriteOrCall, which, based on the input (the value of bit n from (PSP >> 8) ^ last_encrypted_byte ^ key) manipulates the previously mentioned word banks (wordbank0/wordbank1) with some division, multiplication, shifts, etc (it sounds complicated but it's simple enough to just F5 and replicate in IDA) until it eventually maybe decides to call the most important function, which I just called executeRWC.

executeRWC is very funny because in the middle of the controlflow graph there is an innocuous branch that loads a value into R0 and then jumps to it. For a while I thought this was where it jumps inside the encrypted blob, but that is wrong. In fact, it never takes that branch. Go figure! Like I said, this was the most interesting CTF I've ever done.

executeRWC is also very important. It has two other (used) features: that it can arbitrarily read or write memory via sequences to a core's SWD. The action taken, the addresses and data used, and which processor to do it on (this is also important) are vaguely derived from the value given to it, which is derived from how the word banks are manipulated, which is derived from the bit setting of the decrypted byte and the PSP, which is certainly a mouthful.

As expected, the writes are mostly used for anti-debug. After initializing the systick (!!!) it spams writes to 0x4001C080 with the value 0x80 -- this is the pad ctrl disable bit for the external SWD pin, which is why it's impossible to debug the rp2040 while it's decrypting the blob. They also periodically read from this register to make sure it's still 0x80.

There are other things it does with the SWD writes, but that's also for later.

SWD reads will read the value and then manipulate the VTOR. Yes, the very same VTOR that is used to derive the encryption key and modify the PSP... meaning that it's essentially a check to see if a memory address contains an expected value.

SWD writes are always done by processor 0, but SWD reads can alternate between 0 and 1 (where 0 is the core executing this code and 1 is the secondary core). However, they are only done by processor 1 when reading 0xE000101C. What is that peripheral, one might ask? It's on the SCS page, but the RP2040 datasheet does not document it. It turns out, of course, that it's documented by the ARMv6-M ARM to be a register holding a the address of a "recently executed" (they deliberately don't define this) instruction.

Because the read is done by core 1, it means they are essentially checking if core 1 is halting at a WFE instruction in the RP2040's bootrom. In other words, they're checking to see if there is any code running on the other core--if you were trying to run e.g. debugging routines on that core but the decryption was failing, this is why. They do this read fairly often, and the value they expect is either 0x180 or 0x174, depending on the bootrom version.

Beyond this they also mostly read the VTOR (self-explanatory) and other SWD comparator registers (which must return 0xFFFFFFFF)... and the systick.

Normally, emulating the systick would be easy, but because it's being read and written by the SWD protocol, we need to keep in mind not only which SWD bit write causes the memory operation (it's the first read done by the 16 bit "turnaround" right after SWCLK is forced high) but also *when* the protocol is permitted to access memory. This is because it must go through the processor core, which can only read/write one address at a time.

After much pain and experimentation I found that both reads and writes dispatch exactly 4 cycles after the instruction that forces SWCLK high.. unless the processor is accessing memory. This means that if an instruction is aligned to 4 bytes, or accesses memory a bunch of times (like POP or LDM), or is 4 bytes wide (like MSR/MRS/BL), the SWD operation will be delayed.

For further example, If an instruction is NOT aligned to 4 bytes and accesses an AHB-lite address, which normally takes 2 cycles, the first cycle will be used to perform the access, and the processor stalls on the second cycle, where the SWD operation can take place. However, if the instruction IS aligned to 4 bytes the first cycle is spent fetching the word it sits on, and the next cycle is used to perform the access, so there is no room and it has to wait for the next instruction. The actual logic is more involved (like with how it interacts with 4 byte instructions and instructions that reference SIO memory) but the bottom line is that getting something to work accurately is not impossible.

..once all of that is implemented, and your emulator properly counts cycles, it's almost smooth sailing from then on. It will keep reading and writing the peripherals mentioned above until a certain point.

Now for the fun part: I imagine the author of the firmware realized that patching the board ID in code was too trivial. To mitigate this, of course, they just.. read it with SWD operations. This requires them to start using PLL_REF, for some reason, (I noticed while replicating this on my Pico that if I didn't use CLK_REF it would freeze.. not sure why) but after that they do the standard reads/writes to 0x18000060 with the message 0x4B.

Obviously an emulator can just see these reads and writes and just return the flash ID, but someone running this on an unintended system will obviously run into trouble.. you could patch the SWD read/write routines and restore the systick, each time, though.

During this, I assume the systick is somewhat unreliable, so they replace a conspicuous global variable pointer to the XOSC COUNT register (which just happens to now run at the same frequency as the SYS clk because it uses the REF PLL) which accesses it in a sequence like:

Code:
LDR Rx, [0x.......] ; gets changed from a random byte to XOSC COUNT address
...
STRB Rn, [Rx] ; in the loop, it stores a part of the VTOR to that address; this is the first access each loop, which explains why they did this.
...
LDRB [Rx] ; first read. if a normal byte address, will be the same value.
          ; if the XOSC count register, will be decremented a bit
...
LDRB [Rx] ; ditto

meaning that in between reads you have to emulate cycle differences. I found that from the write instruction an offset like +3 worked, although at that point I had already counted a 4 cycle delay from writing to that peripheral. Otherwise, it's the same clk counting idea as the systick, but a lot smaller and easier to see if you're doing it right via doing the same on actual hardware.

After verifying the board ID, they reset the sys clk back to what it was and stop using the XOSC COUNT register in the loop.

Doing more of the same typical reads/writes, they eventually read the systick again, which initially threw me off because my value was wrong. Doing exactly the same thing on my Pico revealed that in total the flash accesses add a delay of 52 cycles, which also worked here, thankfully.

Finally, using the SWD interface they will write and execute PIO(!!) programs. The following only omits the typical reads and writes as well as specific writes to SRAM.

-GPIO16 funcsel <= NULL
-writes to PIO0 instr mem starting at 8:

0x6030
0xC010
0x20A0
0x2020
0x6081
0x004A
0xC050
0x6060
0x6020
0x2041
0x00D2
0x203E
0x20BE
0x203E
0x20BE
0x4001
0x0055
0x6020
0x2040
0x00DB
0x203F
0x20BF
0x4001
0x005C

-then, does a bunch of normal reads/writes with varying patterns>

-replaces globalvar pointers in executeRWC with SIO INTERP0_BASE0, INTERP0_BASE1, INTERP0_BASE2
-they dont read any result values from this, so they act like normal memory locations
-reads the value of the registers (with SWD) to make sure they're being manipulated

it then starts writing PIO programs into memory, and executing PIO instructions directly:

RESETS WDSEL <= 0xC00 (bits 10, 11); PIO1, PIO0 (clear bits)
PSM WDSEL <= 0x4000 (bit 14; SIO reset)

it then writes a PIO program to PIO1, again at instruction 8:

swdwrite 0x50300048 <= 0xE021
swdwrite 0x5030004C <= 0xC023
swdwrite 0x50300050 <= 0xA047
swdwrite 0x50300054 <= 0xC3
swdwrite 0x50300058 <= 0x2020
swdwrite 0x5030005C <= 0x20A0
swdwrite 0x50300060 <= 0x84
swdwrite 0x50300064 <= 0x42
swdwrite 0x50300068 <= 0xC040
swdwrite 0x5030006C <= 0x6020
swdwrite 0x50300070 <= 0x6040
swdwrite 0x50300074 <= 0xC023
swdwrite 0x50300078 <= 0xC020
swdwrite 0x5030007C <= 0x4D
swdwrite 0x50300080 <= 0x108E
swdwrite 0x50300084 <= 0xC021
swdwrite 0x50300088 <= 0xA027
swdwrite 0x5030008C <= 0xD1
swdwrite 0x50300090 <= 0x2020
swdwrite 0x50300094 <= 0x20A0
swdwrite 0x50300098 <= 0x52
swdwrite 0x5030009C <= 0xC043
swdwrite 0x503000A0 <= 0xA027
swdwrite 0x503000A4 <= 0xD7
swdwrite 0x503000A8 <= 0x203F
swdwrite 0x503000AC <= 0x20BF
swdwrite 0x503000B0 <= 0x4001
swdwrite 0x503000B4 <= 0x58
swdwrite 0x503000B8 <= 0xA026
swdwrite 0x503000BC <= 0x8020
swdwrite 0x503000C0 <= 0xB6
swdwrite 0x503000C4 <= 0xC041

PIO1 SM0_EXECCTRL <= 0x1C01FB00
PIO1 SM0_SHIFTCTRL <= 0x80010000
PIO1 SM0_PINCTRL <= 0xE0000

then executes the following instructions on PIO1 using the SM0_INSTR register:
swdwrite 0x503000D8 <= 0x16
swdwrite 0x503000D8 <= 0xE042
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xE04F
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xA0E6
swdwrite 0x503000D8 <= 0xA0C3
swdwrite 0x503000D8 <= 0xE041
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xE043
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xE045
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xE041
swdwrite 0x503000D8 <= 0x4044
swdwrite 0x503000D8 <= 0xA046

it then enables PIO1 SM0 (sets bit 1 of reg at 0x000).


PIO1 SM2_EXECCTRL <= 0x1D015780
PIO1 SM2_SHIFTCTRL <= 0x90000
PIO1 SM2_PINCTRL <= 0xD8000

it then executes the following instructions on PIO1 using the SM2_INSTR register:
swdwrite 0x50300108 <= 0xE042
swdwrite 0x50300108 <= 0x4044
swdwrite 0x50300108 <= 0xE041
swdwrite 0x50300108 <= 0x4044
swdwrite 0x50300108 <= 0xE040
swdwrite 0x50300108 <= 0x4044
swdwrite 0x50300108 <= 0xA0E6


PIO1 SM1_PINCTRL <= 0x40001E0
it then executes an instruction on PIO1 using the SM1_INSTR register:
swdwrite 0x503000F0 <= 0xE09F

PIO1 SM1_EXECCTRL <= 0xE480
PIO1 SM1_SHIFTCTRL <= 0x60000
PIO1 SM1_PINCTRL <= 0x20003C00

PIO1 SM3_EXECCTRL <= 0x1C008000
PIO1 SM3_SHIFTCTRL <= 0x90000
PIO1 SM3_PINCTRL <= 0xD8000

it then executes the following instructions on PIO1 using the SM3_INSTR register:
swdwrite 0x50300120 <= 0xE042
swdwrite 0x50300120 <= 0x4044
swdwrite 0x50300120 <= 0xE04F
swdwrite 0x50300120 <= 0x4044
swdwrite 0x50300120 <= 0xA0E6

Finally, along with some other innocuous SRAM write, it will jump to the binary blob by overwriting a return address on the stack, ending the encryption algorithm.

I hope this information was helpful to someone, and if you want to know more please do not hesitate to DM me. I've spent a long time working on this and I would be very glad knowing it helped someone.
This is amazing work! Just wow! Thank you.

please publish the emulator. I want to learn more.
 
  • Like
Reactions: peteruk

Site & Scene News

Popular threads in this forum

General chit-chat
Help Users
    K3Nv2 @ K3Nv2: Yeezy