Compressed text from dialogs. This is where the problems are. The game's script is divided in more than a thousand "files" (there's no file system, they are just in the middle of the rom). Each one contains the dialog from a part of the game. Each one of the "files" is compressed with LZ77 and is located at the addresses you can see in the File column at the translation site. I got all these addresses from the script dump Ritchburn made years ago, but I was also able to find them and dump them on my own by looking for LZ77 compression headers in the ROM. Each one of the files contains a bunch of dialog lines surrounded by code the game interprets (I think this contains which character says each line and the expression they make, style of the characters, etc). Each actual dialog line starts with the bytes 0x08 0x03 and ends with 0x00 0x00, except for the first line of each file that doesn't have anything special to indicate the beginning. You can try extracting files with GBADecmp (other compressors/decompressors can't compress it back to its original size so they cause problems). The address of the first dialog file is 017bcb2c.
Oh boy. That's one of the more annoying ways to store it, though fairly efficient. Yeah, that's almost certainly neither pointer nor length based, aside from the LZ77 block's length and location itself - that 2-byte 0 at the end of each is probably the null terminator, and it's most likely using a variable command length opcode-based scripting system, for which that 0x08 0x03 is presumably the command to print null terminated text, with the implication that it ends with the 0. The good news is you probably won't have to deal with pointer math much except for relocating entire LZ77 blocks - you'll need to find the pointers to the data blocks, and if a data block needs to be moved to grow, move the block and change the pointer. What the game presumably does to actually use the data is decompress the entire block, then parse through it 1 or 2 bytes at a time, then does whatever the opcode it found dictates (which in the case of text is print everything up to the next 0x00 0x00 as dialogue.) The remaining question there is - is there only one entry point per 'file', or can there be multiple? Given the 'thousands' part, probably the first option, and that's gonna be a lot easier to deal with if so - if it's the later, there may be a pointer table somewhere to choose where to start processing opcodes. Another potential sticky bit is if they use jumps instead of skips for choices/etc: in which case there are probably offsets stored right after each jump opcode, and you'd need to update those if moving things. If it's simply 'skip x commands' or 'switch to dialogue file x', though, that won't care if you change the length of a command.
According to
http://en.wikipedia.org/wiki/Shift_JIS both upper and lower-case roman character should only take 1 byte, but I've seen claims otherwise regarding Hajimari's encoding - what's up with that? Are they storing even those as 2 bytes in some abnormal way?
Is each 'file' LZ77'd as a single block, or only the text portions within it? (e.g. are the 0x08 0x03 opcodes compressed as well?) LZ77 being what it is, changing the content while keeping the same uncompressed length should still be quite capable of resulting in a different compressed length, no matter what you do, since it's dictionary-based and will vary results base on the degree of nearby self-similarity; something like "testtesttesttest" should compress much better than "This is a test."
The best way to go about this, albeit a fairly complicated one, is: Find both the pointers to and lengths of the LZ77 blocks (hopefully they're an array somewhere if not right next to/part of the LZ77 header itself in the case of the lengths; poke around the bytes immediately before it, and make sure you know what the contents of the header are; length may be there, or is it purely a signature and dictionary?) Then, when you recompress the blocks, if it's same or shorter length, you just write the length, or if it's longer, you need to move the block to free space at the end of the ROM. The first step here is probably going to be making sure you know exactly where each block starts, since looking for a pointer value that's e.g. 4 bytes later than you should would make it vastly harder. Did you try VisualBoyAdvance-1.8.0-beta3's debugger?
Within the blocks, once you're able to change the uncompressed size, you can probably simply move everything up or down to fit the new dialogue size, since the null terminator will implicitly tell it where the lines end (but beware that jump caveat I mentioned above; if it works for constant dialogue but breaks for things with choices or plot flag dependancies, that's probably why.) With a little more effort figuring out the dialogue opcodes you'd even be able to add or remove screens' worth of dialogue if needed - which you may need to do as well depending on how much the lengths change, does it auto-split too-long-for-one-window text in a single opcode into multiple windows worth?
I assume that offset is file-based - have you pursued finding out where it's loaded in memory? And is GBADecmp open-source?
Each actual dialog line starts with the bytes 0x08 0x03 and ends with 0x00 0x00, except for the first line of each file that doesn't have anything special to indicate the beginning.
That's a bit odd. Each file, when decompressed, starts with a shift-jis dialogue line that ends with 0x00 0x00 but doesn't start with an opcode? You're certain? There -should- be the usual opcode there plus an opcode before it to set portraits/etc if needed (pretty much any plot dialogue) in a typical system like this. Perhaps it simply starts with a DIFFERENT opcode?
EDIT: Further findings:
Did some googling on the GBA's brand of LZ, apparently the first byte is 0x10 signature, then 3 little-endian bytes for the uncompressed size (10EEh/4334d in this case,) which is consistant with your file offset; there's another small block right before it at 17bcb0c too which looks to have 'compressed' to larger than the actual data inside it, and I'm guessing the next blocks are 17BD14C (3A2Eh/14894d), 17BE5EC (ED2h/3794d) and 17BEAAC (229Ch/8860d), right? Do you know what's up with the "PSI3" that seems to end up coming a couple bytes later? Looks like more header but without digging out some source code I can't be sure yet. The byte in between might be a bitflag or something if the header is indeed more than 4 bytes. Anything else in the header COULD matter for resizing, or it might not... have to find out what it is to be sure.
Either way, it seems clear: The only size that's stored is probably the uncompressed size, and it's right there in the header; looks like this is a variant of LZ with no end-marker in the stream too, and sliding-dictionary only. To get the compressed size to see if you're over available space or not, you're going to have to keep track of it as you (de)compress.
There any particular reason your site seems to have the files in no particular order? It looks like actually finding the dialogue for 17bcb2c on it would be a matter of looking through all the pages by hand...? How would I find it to see what I'm looking at?
Also, it's possible they take the first line's print opcode as implied and put the first portrait/etc after, maybe, though I have to wonder if there's something funky going on there in the case of plot dialogue that starts with character movement instead of a print. Stuff like the first PC-Murno scene would have to have opcodes in there for the relevant animations, presumably.
Using VBA's memory viewer, I was able to confirm that 'first' block gets loaded at 0x097BCB2C in memory, which if the above is correct, means next block at 0x097BD14C, compressed size of the 2C block: probably 620h, including header; pretty good compression, but quite believable. It's also looking oddly like headers are always 16-byte (or possibly even 32-byte) aligned with a 12-byte offset, which may mean something (either they want the bit after the first 4 bytes paragraph-aligned, or there may be 12 or 28 more bytes of header before each block.
Figuring the remaining details out may help with figuring out where they stashed the pointers to these things, since we need to know which exact byte they're going to be pointing TO.
A link to some good clean C source code that handles the relevant decompression would be really nice if you have anything like that, BTW.