GAH, gonna shoot myself - Text editing = confusing x.x

Discussion in 'NDS - ROM Hacking and Translations' started by CPhantom, May 31, 2008.

  1. CPhantom
    OP

    Member CPhantom The Noob :(

    Joined:
    May 14, 2008
    Messages:
    587
    Country:
    United States
    So, I've been trying to follow the tutorials, and I am just flat out lost o.o;;
    I'm just testing out how to do this, but I have no clue what I'm truely doing here.

    Is there a game I should start with for translating to make it easier on me? Or am I just going to need to understand the tutorials that are confusing me? o.o;;
    I am really only to the point where the file is ripped apart, so yeah >.>;;;

    newb needs help here.
     
  2. FAST6191

    Reporter FAST6191 Techromancer

    pip
    Joined:
    Nov 21, 2005
    Messages:
    21,731
    Country:
    United Kingdom
    Stuff like Sim City uses plain text files but if you look back I would suggest starting out with New Super Mario Brothers.
    The text is unicode which any hex editor or text editor worth anything these days supports (if not you can just ignore the 00's between characters and use ASCII which every editor I have every used supports) and the pointers were simple enough to work with (I forget exactly what they were but they were nothing spectacular).

    Some simple info examples on NSMB:
    <a href="http://gbatemp.net/index.php?showtopic=32910&hl=" target="_blank">http://gbatemp.net/index.php?showtopic=32910&hl=</a>
    <a href="http://gbatemp.net/index.php?showtopic=33129&hl=" target="_blank">http://gbatemp.net/index.php?showtopic=33129&hl=</a>

    A more general explanation:
    Getting the text:
    GBA roms are all one big file so you can do a search for 08XXXXXX (X is a wildcard) and should you find a section with a lot of them in you have something of interest (it could be music, graphics, video or something but text is a distinct possibility). 09 is used for sections in roms above the 16 megabytes line. Sidenote here would be that there are also other mappings of the cart in the memory with 08 and 09 being the ones with the shortest delay/most urgent. See gbatek memory mappings for more on this.

    Memory tracing: not quite proper tracing and used more for palettes. The text probably finds a way into the ram at some point, you find the text in the ram and search back through the rom for it.
    Similar to this shiftJIS detailed below often starts with an 8 or a 9, so it is worth a look if you have a Japanese game, unicode for most roman characters will look like 00XX so a search for 00 is not a bad idea.
    Tracing proper: you log all reads of the cart in an emulator (VBA-SDL and vbasdl-h for the gba and developers no$gba for the DS are the two most commonly used emulators for this but you can kludge stuff together in other ways).
    Disassembly: related to tracing but going through the games binary by hand can either give the location of the text or the text itself as it is in there with the rest of the code.

    DS specific: files have a nice name or extension within a file system. Often carries across from Japanese games as well.

    Corruption: mess up a sequence of code and see what happens in the game. Not very precise but it can narrow things very quickly. Try and be intelligent if you use this and in the case of the DS do not spend 6 hours corrupting the sound file and run it through a tile editor first to make sure it is not a section of graphics you are working on.

    Scanning: I mentioned tile editors above and they are quite good at giving a representation of a file at a glance (far more so than hex numbers). You can add in some statistical analysis as well (detailed below when encoding is talked about).

    First step after getting the text to begin with:
    Find the encoding.
    The three most common encodings for computers as far as rom hacking is concerned these days (go back in history and it will not be so important) is ASCII:
    <a href="http://www.neurophys.wisc.edu/comp/docs/ascii/" target="_blank">http://www.neurophys.wisc.edu/comp/docs/ascii/</a>
    Notice there it is only the last 7 bits used, when the uppermost bit is used it gets a bit more complicated (there are 128 more characters available and different companies, countries, consortiums, programmers and so on have made their own versions).

    Unicode:
    Unicode is an odd one, it is 16 bit unlike ASCII which means it is easy enough to accomodate near enough every character recognised by mankind but after the initial "ASCII" section
    <a href="http://unicode.org/charts/" target="_blank">http://unicode.org/charts/</a>
    There are also several implementations of each character set and any and all could have been used).

    ShiftJIS when dealing with Japanese. It is not the most obvious of encodings (if you look at ASCII 3? is a number (33 being 3, 35 being 5.....) and capitals are 2? lower than the lower case (41= A, 61=a) and it is all relative: 41=A,42=B,43=C) but it works well enough:
    <a href="http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml" target="_blank">http://www.rikai.com/library/kanjitables/k...odes.sjis.shtml</a>

    Of course not all use them which means you may have to find them, here it can get very interesting. A note here Capcom have their own table they use in some of their games which is readable in a plain ASCII editor as the lower case characters are read as ASCII upper chase characters.

    Earlier I mentioned relative sets of characters (41=A, 42=B) and you can abuse this by doing a relative search for a number sequence (simple example is CAB which would have a pattern like x,x-2,x-1). This will spit out a list of results but it gives you something to work with. Longer inputs are better but can be tripped up be increased "problems" as detailed in the ---- encompassed section.
    Other things involve repeated sections or words (some games I have worked on have had a quirky charcter who repeats lines or has a phrase (kupo from final fantasy titles being a good example). Note however that capitals and lower case characters are not likely to be relative and I have seen the two characters encoded one after the other (AaBbCc.....).

    Brute force:
    Find the text by disassembly or corruption (corruption is simply changing a section and seeing what happens ingame) and then try characters (best to make it fairly methodical) and get it that way. Downside is that it can take a while, upside is that it is fairly foolproof.

    Statistics: so far I would hazard a guess that the space key was my commonly typed charcter of all the ones available so a breakdown of the characters used would give out some good ideas as to what the space is encoded as which then provides a base for you to guess the space character.
    You can expand such a concept into plain linguistics as well:
    I saw it in a film (Zodiac I think it was) but it is a good example, a serial killer sent coded notes to newspapers with no key. The key was decoded by the fact that the word kill (and so the repeated l character and then the k and i) was bound to be in there somewhere. The most simple example would be for the word français (French language word for french), you would likely see fran?ais in an editor which you could than guess whatever the ? is is supposed to represent a ç.
    Hackers hangman/crosswords can also be played once you have the space and some idea of what is what.

    Disassembly: the only surefire way but it requires knowledge of assembly for the console you are working with.

    Font viewing: The font is usually encoded as it appears in the table (order wise) so it can help narrow things. External encodings (outside the binary) are very very rare so I would not count upon them.

    ----------------------------------------------------------

    Problems:
    Japanese is not relative when it comes to kanji (stroke order, first to appear in the game/script and many more things can be used as well as the fact there are many kanji and not all will be used in a game). The problem is not limited to Japanese either.
    16 bit, most relative searches are for 8 bit characters. You can use a wildcard to get around this though. It has been heard of for a 12 bit character map and a 24 bit character map being used as well but thankfully that is very very rare.

    MTE/DTE, multi tile and dual tile encoding. Strictly is is multi tile encoding but dual tile encoding is the most common.
    Here a word or some characters are encoded as a single or more characters
    Sidenote if you edit the font you can have a rather nice halfway hack to give the illusion of variable width fonts (Japanese characters are all approximately the same width while ijl looks odd if spaced as much as say a w (try it in a codebox).
    A cousin of this is the halfwidth characters for Japanese, I mentioned that Japanese characters are the same size for the most part but they are also fairly wide which means two characters each forming half of the character are used to make a whole one.
    Another cousin of this is the variable, often you get the chance to name characters or there is a variable amount of gold in your bank/wallet and this needs calling upon. This can come in anything from programming/xml style "[main char]" arrangements to just another "character" in the table ("you need 4F gold to stay the night").

    Not all text sections use the same table, this can mean anything from a different extended set of ASCII to a whole different table. This is mainly limited to having different tables for the menu, ingame sections, cutscenes but multiple tables for a section of speech has been heard of on several occasions.

    Compression: long the bane of the romhacker but text is large, cumbersome and a good candidate for compression. The GBA/DS bios has decompression functions with arguably the most common being one of the LZ implementations but a game is not limited to those.

    ---------------------------------------------

    You take the info on what the characters are and put it in a so called table file, there are a few differetn sorts and this what you use will depend on what you are doing and/or like to use.
    Very few commercial editors and general purpose editors support tables (and even fewer support stuff like DTE/MTE and variables) but there are a whole bunch of hacking editors available at romhacking.net
    <a href="http://www.romhacking.net/?category=13&Platform=&game=&author=&os=&level=&perpage=30&page=utilities&utilsearch=Go&title=&desc=" target="_blank">http://www.romhacking.net/?category=13&amp...itle=&desc=</a>

    From here you deal with pointers and resinsertion issues. Pointers are detailed in the rom hacking sticky (they are simply a list of locations at which lines and/or paragraphs/sections begin/end and insertion issues are mainly things like only 70 characters per line/3 lines per paragraph.

    Fonts are simply the graphical representations of characters and for the most part are simple. ntfr is a common DS format and a good example of a halfway house between the simple row of tiles approach and the more complex generated set of tiles, deufeufeu did a bunch of work on the latter type:
    <a href="http://deufeufeu.free.fr/wiki/index.php?title=Main_Page" target="_blank">http://deufeufeu.free.fr/wiki/index.php?title=Main_Page</a> and crystatile2 supports ntfr if you are looking around.

    Edit: I have a horrible feeling I missed out some important info and/or a technique and I am fairly sure this will probably not help much for your situation beyond the first and last paragraphs (lots of info without saying all that much) but it is typed now I guess.
     
  3. CPhantom
    OP

    Member CPhantom The Noob :(

    Joined:
    May 14, 2008
    Messages:
    587
    Country:
    United States
    Hmm okay.

    I assume then that is it just the game style and company choice on how the text is stored?

    What about Sim games? I'm trying to learn this as there is a game coming out shortly that is based off an anime that will not be translated for an English speaking audience. It is going to be a sim game. The only other thing I can think of that it may be like is Doki Doki Majo Shinpan (without the little girl touching or perversion). So, I figured that working with Doki Doki might actually help out in the translation of this upcoming release. Do you think that is a good thing to do? Or is it to hard to translate these heavily text-based games at first?
     
  4. FAST6191

    Reporter FAST6191 Techromancer

    pip
    Joined:
    Nov 21, 2005
    Messages:
    21,731
    Country:
    United Kingdom
    From a hacking standpoint it is not usually any more difficult other than it being a bit more tedious to calculate the pointers and sort the text spacing issues. The actual translation is what takes the time but if you just want to add new text in saying whatever you want it to say then that is OK.
    I will point out that puzzle games and other low text stuff can use pictures instead.

    I would not limit myself to translating similar games as near enough any game that uses text will be good experience.
     
  5. CPhantom
    OP

    Member CPhantom The Noob :(

    Joined:
    May 14, 2008
    Messages:
    587
    Country:
    United States
    Okay, and is there a recommended way of doing this? I have Hex Workshop, but I'm not really sure how to use it or what to look for. Nor do I understand what files I'm opening xD

    I notice for Doki Doki Majo Shinpan there is a Talk folder, which I assume would have some of the text, yet I don't really see anything with Hex Workshop o.o;;

    Also, there is a Fonts folder, which might be what I should look for as well? I don't know what to do there x.x;;

    I'm definitely going to get the New Super Mario Brothers on my comp so I can get some knowledge on all this. I am so Hex stupid [​IMG]
     

Share This Page