Again, "Erase unit alignment" is the magic sauce here.
I will try to explain why:
Internally, the SD card can only write at the granularity of the erase unit. NOTHING SMALLER. To do smaller writes, it first reads the entire erase unit, performs the smaller write onto the read data, then writes the entire erase unit back to the device. This is the expensive "Read, modify, write" cycle I mentioned prior.
The erase unit size can be 64kb or higher on large capacity modules, yet most file systems expect a cluster/allocation unit size that is closer to 4kb maximum. That means a "Sequential write" on a 64kb erase unit card, using a 4kb cluster file system, will do SIXTEEN read-modify-write cycles in a row to update the full erase unit-- when it COULD just do ONE write operation, if the alignment was correct.
Additionally, read speed is also impacted because the page size (amount the microcontroller will actually natively read per read operation) is some fraction of the erase unit size, if not exactly the erase unit size. Meaning, when you request to read a 4kb data chunk, it actually reads the full 64kb block, cuts the data not needed, and gives the return. You can skip the cut operation completely by requesting whole pages, by setting the cluster size properly.
Why is "dead nutz alignment" needed?
Imagine what happens when your requested sectors are not exactly on the erase units-- It has to read two units to get your data, then cut+paste to get the data you request, and when you do a write, you write 2 times, on two erase units--- AT BEST.
If you reformat your SD Card, be very sure to do it in a way that properly respects the erase unit size, if you want maximum performance from the card.
Again, I learned this lesson the hard and painful way on my linux-running chromebook. I have since done some clever hackery to make the EXT4 file system produce dead-nuts aligned 64kb data chunks (even though the max sector size is 4k for ext4, by abusing its raid functionality.) and performance went WAAAAAAAY up.
ExFat natively supports such huge allocation unit sizes. Pick one that is appropriate for your card, and ensure that your partition starts exactly at the start of an erase unit.
You *WILL NOT* get ideal performance from your card unless you do.