Hacking TempGBA: NDSGBA revival

Deleted member 319809 · May 23, 2013

masterz87 said:
That is probably going to make this emulator(when it's done) the most interesting GBA emulator I've seen ever, and probably one of the coolest emulators I've ever seen(outside of the LLEs out there that do some insane stuff)... but anyway whenever you get that done, that's definitely going to be _very_ _very_ nice. Anyway keep up the great work, I'm hoping that the cache isn't too architecture specific, even if it is mips based, that's fine since I have a gcw coming, and I'm hoping they get that added to their gpsp. Since I'm hoping to combine that+my ds2 for the ultimate "only have to carry this" thing. With the psp emulator making a nice amount of progress, I'm hoping they do a mips one so I can literally only have to carry one console with me.

a) I don't know what you know about the DSTwo and the GCW, and the passages I've highlighted don't help either. Do you know that the DSTwo's internal processor and the GCW's are both MIPS?
b) Here's the commit that introduced the recompiler changes in beta 15 (which was later reverted). None of them are MIPS-emitter-specific or MIPS-stub-specific, so they could apply to ARM or PC if someone wanted them (woo!). However, the statistics I collect and display about the code cache are DSTwo GUI specific.

masterz87 said:
But that's off topic, thanks for the great work keep it up, I love the progress you're making since I honestly never thought that this'd be as good as it is right now even with the cpu it has. Anyway thanks for the info that it is ~6MB/s maximum read speed. I may end up using a slightly faster card(since I'm getting a bigger one anyway and it's cheaper to get a faster one), but anyway thanks for the information.

P.S. are you the guy who's also doing the snes emulator for the ds2? If so, will that then try to use something similar? Or is this going to be a gpsp only system since I remember reading that you said exophase had an idea for it, no idea how well developed it was when he quit but was just wondering. Keep up the great work!

c) It was pretty on-topic, I find

d) Yes, I am.
e) I don't really want to add recompilers to Snes9x. First and foremost, because they're very, very complicated and not cross-platform at all, and the goal of Snes9x is to use portable code. And secondly, Snes9x does have a well-maintained project! I could take over gpSP and add whatever I like, because Exophase doesn't care about the code anymore. But if I tried to do that for Snes9x, I'd end up making a recompiler over really old code (Snes9x 1.43) that wouldn't apply to their code (Snes9x 1.54) anymore, so they could not use it even though our projects both have open source. Snes9x 1.54 cannot be used in the DSTwo because, in a twist of irony, its performance is twice or 3x worse than 1.43 (which a recompiler would boost, but it would be boosting performance of 1.54 back to 1.43-levels), but also it contains large amounts of C++ code and the DSTwo SDK only supports C.

I hope this post was enlightening.

masterz87 · May 24, 2013

Nebuleon said:
a) I don't know what you know about the DSTwo and the GCW, and the passages I've highlighted don't help either. Do you know that the DSTwo's internal processor and the GCW's are both MIPS?
b) Here's the commit that introduced the recompiler changes in beta 15 (which was later reverted). None of them are MIPS-emitter-specific or MIPS-stub-specific, so they could apply to ARM or PC if someone wanted them (woo!). However, the statistics I collect and display about the code cache are DSTwo GUI specific.

c) It was pretty on-topic, I find
d) Yes, I am.
e) I don't really want to add recompilers to Snes9x. First and foremost, because they're very, very complicated and not cross-platform at all, and the goal of Snes9x is to use portable code. And secondly, Snes9x does have a well-maintained project! I could take over gpSP and add whatever I like, because Exophase doesn't care about the code anymore. But if I tried to do that for Snes9x, I'd end up making a recompiler over really old code (Snes9x 1.43) that wouldn't apply to their code (Snes9x 1.54) anymore, so they could not use it even though our projects both have open source. Snes9x 1.54 cannot be used in the DSTwo because, in a twist of irony, its performance is twice or 3x worse than 1.43 (which a recompiler would boost, but it would be boosting performance of 1.54 back to 1.43-levels), but also it contains large amounts of C++ code and the DSTwo SDK only supports C.

I hope this post was enlightening.

First, that's interesting that you're using snes9x I've used it before, and well as I just now realized that you did it, I didn't even know the thing existed to a little while ago so thus my knowledge on the subject is very low. About the CPU, from what I've read/heard the DSTwo uses an ingenic cpu(I don't remember where I saw that anymore...) but I do know that it's mips and I could've sworn wherever I saw that, that it was the same type of cpu in the dingoo a320. The gcw is also a mips based console. It uses an ingenic cpu that's clocked at 1ghz, which means it's essentially a supercharged dingoo. Also I was talking about the psp emulator guys doing a mips port of their emulator, although I _seriously_ doubt you'd be able to get the performance down to 2-3x as slow as native even though their share the same instruction set.

Also that's sad to hear that snes9x has gotten so much slower... that really makes me sad. I figured that the dstwo's sdk used C since it's so simple of a language and thus is safe for use on almost everything. Anyway that's cool that you've got it working on the ds2.

P.S. It was a thread on the supercard forums with someone saying that it's essentially an ingenic cpu with similar processor of the dingoo, obviously they cannot be taken for exact truth, but since it's a mips cpu, the clock speed is similar, I figured it was "close enough" to take for granted.

http://www.gcw-zero.com/specifications That's the gcw's site, it says it's 1ghz mips cpu, and it's also an ingenic cpu.

Also about the ds2 being an ingenic CPU from the english SDK documentation it contains the following statement "... specifically, JZ and FPGA..." That's also what makes me to believe that it's an ingenic CPU as all of theirs start with jz(for whatever reason)

Edit: Also I was saying that because they share similar cpus, now then obviously the gcw _doesn't_ contain an FPGA, but I'm unsure of how much you're using the fgpa for your emulation, if you're using it for the compiler cache then obviously it'd be harder, but I've already opened an issue on github to the guy who ported the gpsp that was made for the dingoo(also mips and ingenic cpu) and I'm hoping that he can maybe help you too, since both of you are using a similar CPU, and the same instruction set. I don't know what all he's done to the gpsp source code, but I was hoping that the both of you could work together and make it even awesomer.

Also here's the repo of his Gpsp. https://github.com/pcercuei/gpSP

It's made to run on linux for mips, and thus it's using an actual OS instead of just the firmware of ds2 but I was/am hoping he can help you, and you can both merge your gpsps into an awesome project and keep it going so that gpsp gets a ton of awesome new features. All he's done(as far as I know) is made it work on the dingoo/gcw(gcw is just a newer dingoo pretty much since they both run linux and both use similar hardware gcw is just newer/more powerful hardware).

Edit 2: I just saw that the fpga cannot be altered(really) which is no surprise so I imagine you guys are both reusing a lot of the same code(outside of your stastics) anyway keep up the awesome work. I saw gpsp way after the exodous of exophase, so it's great to see that it's getting updated, and is getting new features.

As far as it being on the PC, I don't know how well that'd be for most people since there's quite a few other gba emulators for them and considering how much power everyone has I don't see a lot of people wanting it, now arm that's one I can definitely see since most gba emulators I've tried for android all use ~75%+ of my cpu on my phone(admittedly there is overhead of the dalvik vm etc etc but still).

Deleted member 319809 · May 24, 2013

masterz87 said:
Edit: Also I was saying that because they share similar cpus, now then obviously the gcw _doesn't_ contain an FPGA, but I'm unsure of how much you're using the fgpa for your emulation, if you're using it for the compiler cache then obviously it'd be harder, but I've already opened an issue on github to the guy who ported the gpsp that was made for the dingoo(also mips and ingenic cpu) and I'm hoping that he can maybe help you too, since both of you are using a similar CPU, and the same instruction set. I don't know what all he's done to the gpsp source code, but I was hoping that the both of you could work together and make it even awesomer.

Also here's the repo of his Gpsp. https://github.com/pcercuei/gpSP

It's made to run on linux for mips, and thus it's using an actual OS instead of just the firmware of ds2 but I was/am hoping he can help you, and you can both merge your gpsps into an awesome project and keep it going so that gpsp gets a ton of awesome new features. All he's done(as far as I know) is made it work on the dingoo/gcw(gcw is just a newer dingoo pretty much since they both run linux and both use similar hardware gcw is just newer/more powerful hardware).

It's interesting that you put forward a link to pcercuei's gpSP, because he helped me debug some things already - and I do intend to help his repository out however I can when I'm able to test things.

Regarding the FPGA, I don't use it at all.

Deleted member 319809 · May 25, 2013

I asked for something that would help me debug the 00000000 return address problem, and received this.

A Supercard with exposed chips, but more importantly, serial port wires and a USB dongle at the end. It outputs at 57600 or 115200 baud via the JZ4740's UART and its input and output are accessible to Linux at /dev/ttyUSB<n>. I'm free to implement memory dumpers and function loggers over that bus; I just need to define my protocols. The serial port may also survive some of the "crashes" of slower-rendering games at 0 frameskip and allow me to see what's up a bit more.

Deleted member 319809 · May 25, 2013

As I suspected, the crashes at frameskip 0 are not caused by the code in gpSP. The program does not receive an exception at that point, and the serial port stops receiving audio lag notices, so the emulation does not continue while the audio/video communication link is broken. The most likely culprit is a wait-loop in the DSTwo-side code becoming infinite for some reason.

Fixing this would require delving into the protocols used in the DSTwo-to-DS communication link.

Boriar · May 25, 2013

Nebuleon said:
I asked for something that would help me debug the 00000000 return address problem, and received this.

View attachment 2737

A Supercard with exposed chips, but more importantly, serial port wires and a USB dongle at the end. It outputs at 57600 or 115200 baud via the JZ4740's UART and its input and output are accessible to Linux at /dev/ttyUSB<n>. I'm free to implement memory dumpers and function loggers over that bus; I just need to define my protocols. The serial port may also survive some of the "crashes" of slower-rendering games at 0 frameskip and allow me to see what's up a bit more.

Was Supercard team who send it to you?

Deleted member 319809 · May 25, 2013

Boriar said:
Was Supercard team who send it to you?

Yes

I sent an email to them and they arranged for a Supercard to be sent to me with a serial line. It took literally 4 days to arrive.

Boriar · May 25, 2013

I suppose that it don't let you modify the FPGA, isn't it?
And about linux, can some information provided by it be usefull for DS2linux?

Deleted member 319809 · May 25, 2013

Boriar said:
a) I suppose that it don't let you modify the FPGA, isn't it?
b) And about linux, can some information provided by it be usefull for DS2linux?

a) That is correct. This debug Supercard has exactly access to the same things as a regular one. A regular Supercard also has a UART for serial line output, but no wires come from the pins at all, which is the only difference.
b) During the Supercard's initialisation sequence, only the very unhelpful characters "BC" are ever output, at 57600 baud. That won't be useful

masterz87 · May 26, 2013

Nebuleon said:
I asked for something that would help me debug the 00000000 return address problem, and received this.

View attachment 2737

A Supercard with exposed chips, but more importantly, serial port wires and a USB dongle at the end. It outputs at 57600 or 115200 baud via the JZ4740's UART and its input and output are accessible to Linux at /dev/ttyUSB<n>. I'm free to implement memory dumpers and function loggers over that bus; I just need to define my protocols. The serial port may also survive some of the "crashes" of slower-rendering games at 0 frameskip and allow me to see what's up a bit more.

about it being a jz4740 which is similar to the dingoo is really nice/good. I'm glad to hear that, also it's cool that supercard team sent you a card with the wires all connected and such, very very awesome of them. They seem like a really good company(to me at least) or at the very least very committed to keeping the thing going.

Deleted member 319809 · May 26, 2013

Nebuleon said:
As I suspected, the crashes at frameskip 0 are not caused by the code in gpSP. The program does not receive an exception at that point, and the serial port stops receiving audio lag notices, so the emulation does not continue while the audio/video communication link is broken. The most likely culprit is a wait-loop in the DSTwo-side code becoming infinite for some reason.

Fixing this would require delving into the protocols used in the DSTwo-to-DS communication link.

FUCK YOU, SUPERCARD SDK.

Code:

// Wait for at least one buffer to be free for audio.
// Output assertion: The return value is between 0, inclusive,
// and AUDIO_BUFFER_COUNT, inclusive, and is lower than
// AUDIO_BUFFER_COUNT if one buffer is free.
while (ds2_checkAudiobuff() >= AUDIO_BUFFER_COUNT);

Right? Wrong.

Code:

// Wait for at least one buffer to be free for audio.
// Output assertion: The return value is between 0, inclusive,
// and AUDIO_BUFFER_COUNT, inclusive, but can also be
// 4294967295 -- that's (unsigned int) -1.
unsigned int n2;
while ((n2 = ds2_checkAudiobuff()) >= AUDIO_BUFFER_COUNT && (int) n2 >= 0);

I isolated this to the frameskip code in sound.c via the serial line.

Code:

[  596.820181] I: Decreasing automatic frameskip: 8..7
<insert freeze here>

masterz87 · May 26, 2013

About the bus speed limitations, I don't know if you have the ability to run code on the ds' cpus that wait on the framebuffer/audio buffer. If you can do so, and are able to do, I'd look into this compression algorithm

https://code.google.com/p/lz4/

You could compress the buffers using the ds2's cpu, then send it to the ds' cpu decompress it, and finally display it. I'm sure even though the added latency of compressing/decompressing it should help you get closer to 60fps(for those that want it) for me 30fps consistent is perfect and smooth for me. 60 is great but honestly 30 is the ideal framerate for me at least.

It's insanely fast, and also compresses decently. It could help you get around that 6MB/s limit that the ds has. I don't know if you could do what I said, but it's something to look into. Also something that I'd like added to gpsp is the ability to have the "snapshots" or quicksaves or whatever, to have a screenshot of the screen+the save file in a zip file or an lz4 compressed archive(zip would be easier obviously). Also I know that the save files are small(as in insanely so) but changing the save format to use lz4 compressed save files would also be a semi-nice thing to have even though ti's not a huge thing. Anyway keep up the great work.

Deleted member 319809 · May 26, 2013

masterz87 said:
a) About the bus speed limitations, I don't know if you have the ability to run code on the ds' cpus that wait on the framebuffer/audio buffer. If you can do so, and are able to do, I'd look into this compression algorithm

https://code.google.com/p/lz4/

You could compress the buffers using the ds2's cpu, then send it to the ds' cpu decompress it, and finally display it. I'm sure even though the added latency of compressing/decompressing it should help you get closer to 60fps(for those that want it) for me 30fps consistent is perfect and smooth for me. 60 is great but honestly 30 is the ideal framerate for me at least.

It's insanely fast, and also compresses decently. It could help you get around that 6MB/s limit that the ds has. I don't know if you could do what I said, but it's something to look into.

b) Also something that I'd like added to gpsp is the ability to have the "snapshots" or quicksaves or whatever, to have a screenshot of the screen+the save file in a zip file or an lz4 compressed archive(zip would be easier obviously). Also I know that the save files are small(as in insanely so) but changing the save format to use lz4 compressed save files would also be a semi-nice thing to have even though ti's not a huge thing. Anyway keep up the great work.

a) The code on the DS side expects a certain protocol, and its code is not readily editable because it only compiles on an ancient version of libnds/devkitARM that is not available for download anymore. I cannot change the protocol on only one side and have it work.

Even if the code were readily editable, LZ4 is a bit overkill for this application. Deflate (zip, gzip, etc.) would do pretty much the same thing, be faster for compression, and already be in the code on the MIPS side (TempGBA has .zip support) and maybe on the DS side, so it would not incur any more RAM usage on the DSTwo.

Deflate has good compression, and low CPU overhead at lower compression levels (ultra-fast, fast, etc.), and the FPS is 42 or so while it should be 60. If Deflate can compress most images by just 29%, then 60 FPS can be achieved.

Alternatively, techniques like run-length encoding, such as used in PCX, could help many images become smaller. The compression would be very fast, running in linear time, and be easy to implement in < 1024 bytes of code on both sides.

b) I refer you to this excerpt:

readme.txt in gpSP 0.9 by Exophase (emphasis mine) said:
Q) Savestates? From other emulators??

A) See the savestates option in main menu. gpSP will probably never
support savestates from other emulators, there's just too much in the
way of emulator specific data in them.

Savestates are currently 506,943 bytes. They would be a little smaller
without the snapshot, but I find that very useful and it wouldn't help
size immensely. Compression would help, but I wanted the size to be
constant so you knew exactly how much you could hold and to improve
save/load speed.

This may have been unwarranted, but is now more useful.

See, gpSP 0.9 by Exophase was made 6 years ago, when memory cards were more expensive and even lower capacity than now. Users would have wanted the most saved states to fit on their cards at once, and .zip compressed ROMs, so I don't think that excerpt even made sense at the time - it would have gone against the wishes of users.

Now, though, with our large-capacity cards, we can fit more saved states and games on a card before it fills up. It is indeed faster to load uncompressed data than compressed sometimes, but only if the CPU cost of decompressing the data is made up by loading less data from the card, and the CPU cost of compressing the data is made up by writing less data to the card. Because that's card-speed-dependent, and because most people buy class 4 cards, it makes sense to keep the uncompressed loader to simplify the code.

Reading the screenshot from a compressed save state would be really slow, and that's re-read every time you switch saved states in the menu to preview them before loading or overwriting them. So if, hypothetically, I were to implement compressed saved states, there would be at least 115,200 bytes left uncompressed.

masterz87 · May 27, 2013

Nebuleon said:
a) The code on the DS side expects a certain protocol, and its code is not readily editable because it only compiles on an ancient version of libnds/devkitARM that is not available for download anymore. I cannot change the protocol on only one side and have it work.

Even if the code were readily editable, LZ4 is a bit overkill for this application. Deflate (zip, gzip, etc.) would do pretty much the same thing, be faster for compression, and already be in the code on the MIPS side (TempGBA has .zip support) and maybe on the DS side, so it would not incur any more RAM usage on the DSTwo.

Deflate has good compression, and low CPU overhead at lower compression levels (ultra-fast, fast, etc.), and the FPS is 42 or so while it should be 60. If Deflate can compress most images by just 29%, then 60 FPS can be achieved.

Alternatively, techniques like run-length encoding, such as used in PCX, could help many images become smaller. The compression would be very fast, running in linear time, and be easy to implement in < 1024 bytes of code on both sides.
b) I refer you to this excerpt:
This may have been unwarranted, but is now more useful.

See, gpSP 0.9 by Exophase was made 6 years ago, when memory cards were more expensive and even lower capacity than now. Users would have wanted the most saved states to fit on their cards at once, and .zip compressed ROMs, so I don't think that excerpt even made sense at the time - it would have gone against the wishes of users.

Now, though, with our large-capacity cards, we can fit more saved states and games on a card before it fills up. It is indeed faster to load uncompressed data than compressed sometimes, but only if the CPU cost of decompressing the data is made up by loading less data from the card, and the CPU cost of compressing the data is made up by writing less data to the card. Because that's card-speed-dependent, and because most people buy class 4 cards, it makes sense to keep the uncompressed loader to simplify the code.

Reading the screenshot from a compressed save state would be really slow, and that's re-read every time you switch saved states in the menu to preview them before loading or overwriting them. So if, hypothetically, I were to implement compressed saved states, there would be at least 115,200 bytes left uncompressed.

Lz4 uses way too much CPU? I don't know what you've been reading, or what algorithm you think I'm using but lz4 is at least 10x as fast as deflate and uses way less CPU time when compressing and also decompressing. I don't know what you were looking at when you were looking into lz4, but I cannot believe that deflate is somehow faster. I've never once seen anything showing deflate in fast/ultrafast being faster than even lzo/quicklz let alone lz4. That's sad to hear that the protocol/how it communicates is so old/out of date. But still though, deflate is way way slower than lz4, I cannot remotely imagine that you've found it to be faster on arm at all. I'm unsure of where you saw that it was faster to just use deflate.

Next up about the compressed save states, I guess, if it's too slow that's perfectly fine, I just thought it'd be an interesting thing to have since it'd help differentiate them more. I know of only a few emulators that even implement it, and it usually helps me to realize where I was when I look at them.

Edit: If you're looking at the c. speed and you see that the lz4 one is _larger_ this is a benchmark and it scores them based upon speed. So the higher the number, the higher number of times that something can be compressed/decompressed in some unit of time. I'm unsure how exactly it calculates it. But I can tell you this from my own experiences with lz4 vs deflate that lz4 at _worst_ is ~5-10x faster than deflate.

I didn't realize that someone could see that benchmark as being a "higher == worse", I'll send a messgae to the project maintainer and tell him to make it clear that the higher the number the better. Because it's not clear, and I never realized that someone could see it that way.

To just reiterate once again, on that benchmark higher is better. The higher the value the faster it is, and the higher number of times it can compress the file per given of period of time(unsure of the exact amounts as I've not dug into the benchmark myself).

Deleted member 319809 · May 27, 2013

masterz87 said:
Lz4 uses way too much CPU? I don't know what you've been reading, or what algorithm you think I'm using but lz4 is at least 10x as fast as deflate and uses way less CPU time when compressing and also decompressing. I don't know what you were looking at when you were looking into lz4, but I cannot believe that deflate is somehow faster. I've never once seen anything showing deflate in fast/ultrafast being faster than even lzo/quicklz let alone lz4. That's sad to hear that the protocol/how it communicates is so old/out of date. But still though, deflate is way way slower than lz4, I cannot remotely imagine that you've found it to be faster on arm at all. I'm unsure of where you saw that it was faster to just use deflate.

http://code.google.com/p/lz4/source/browse/trunk/lz4.c

Code:

// Increasing memory usage improves compression ratio
// Reduced memory usage can improve speed, due to cache effect
// Default value is 14, for 16KB, which nicely fits into Intel x86 L1 cache
#define MEMORY_USAGE 14

To be effective, the compression needs to use more memory. In this comment, one can see that the choice of compression buffer size greatly influences the speed. The data cache on MIPS is 8 KiB, and there's no L2 like on Intel.

Code:

// Unaligned memory access is automatically enabled for "common" CPU, such as x86.
// For others CPU, the compiler will be more cautious, and insert extra code to ensure aligned access is respected
// If you know your target CPU supports unaligned memory access, you want to force this option manually to improve performance
#if defined(__ARM_FEATURE_UNALIGNED)
#  define LZ4_FORCE_UNALIGNED_ACCESS 1
#endif

MIPS cannot do unaligned access. As the algorithm requires writing 16-bit and 32-bit values to memory at arbitrary addresses, this will slow down on MIPS.

Code:

if unlikely(forwardIp > mflimit) { goto _last_literals; }
[...]while likely(ip<matchlimit-(STEPSIZE-1))

Branch Likely and Branch Unlikely are great on X86 because of the branch predictor gates, but deprecated on MIPS. Compiler writers are advised never to emit those instructions. This will slow down on MIPS.

Code:

// Define this parameter if your target system or compiler does not support hardware bit count
#if defined(_MSC_VER) && defined(_WIN32_WCE)            // Visual Studio for Windows CE does not support Hardware bit count
#  define LZ4_FORCE_SW_BITCOUNT
#endif

The '1'-bit count instruction on X86, POPCNT, is great. However, it doesn't exist on MIPS. The only MIPS instructions that could be of value are in MIPS32 r2 (the DSTwo is a MIPS32 r1), and they're CLO and CLZ (Count Leading Ones and Count Leading Zeroes). That's not very helpful, because the algorithm wants to see how many 1s are left.

All of this would make LZ4 slower on MIPS than on Intel, therefore the speed of LZ4 may be equal to, or worse than, Deflate on MIPS.

masterz87 · May 27, 2013

Ah well that's sad tohear. I knew that mips/other risc processors have way less complexity in the way the processors work and that's said to hear. Have you looked at lzo then? Or quick lz? I'm unsure as to how much they specifically target x86. I didn't think that lz4 was that bad since I know a developer used lz4 in their game and found it to be faster than deflate on teh psp so I figured that it would be faster on here too.

here's a link to his blog post talking about the game company using lz4 in their application.

http://fastcompression.blogspot.com/2011/10/lz4-in-commercial-application.html'

The reason why I linked to it here was because I knew the psp uses a mips cpu(unsure of the exact revision of the instruction set but I know it's mips) and thus the performance should be similar for you. Also I just relooked at the blog post again, and apparently tehy're using it for decompression only which might explain why it's not as good which is very sad to hear. I don't know if you've looked at lzo/quicklz but I know that they're out there and they are also much faster than deflate.

Edit: looking over the lzo source code they have at least defines in the source code for mips/arm so maybe that'd be worthwhile for compressing/storing the screenshots with the other data in the snapshots.

http://www.oberhumer.com/opensource/lzo/#download

The mini-lzo is the one I was looking at since lzo1x is the one that most people use when doing benchmarks against other algorithms. I don't know if you've already tried it out, I'd like to run some tests with it but the only mips platform I have lying around is my psp, and the ds2 but I'm not so confortable writing code/benchmarks on that. When the gcw comes, I'll try some benchmarks on it since I can get my way around linux.

I hadn't read too much into the lz4 source code myself, since everything I've been doing with it has been on x86 and thus well yeah. Sad to see that it's so x86 specific, here's hoping the guy can optimize it for other architectures so it can be used elsewhere.

Edit 2:I'll send him this way, so you can tell him about the things. Or I might just open up a bug report about it with your information if that's OK.

Also you may want to talk with him on the google code page about the issues you raised here. As I'm not that informed on the minutia of avoiding cache misses/keeping the data inside of the cache as much as possible.

Cyan4973 · May 27, 2013

I would like to give some details on a few claims on LZ4 source code.

> To be effective, the compression needs to use more memory.
> In this comment, one can see that the choice of compression buffer size greatly influences the speed. The data cache on MIPS is 8 KiB, and there's no L2 like on Intel.

That's correct, and that's also why the memory parameter can be modified.
For a CPU with a data cache of 8KB, i would recommend to use only 4KB for the Hash table, hence a parameter :
#define MEMORY_USAGE 12

Note that cache effect is obviously not specific to LZ4, any algorithm has to face it.
At least here, the opportunity exists to clearly tune memory requirement to cache setup.

> MIPS cannot do unaligned access. As the algorithm requires writing 16-bit and 32-bit values to memory at arbitrary addresses, this will slow down on MIPS.

That's correct.
Here also, the slow down is valid for every other algorithm. There is no reason why LZ4 should be more affected than other algorithm.
Note that LZ4 has been specifically tested and validated on align-only CPU, and has been reported to work well.

> Branch Likely and Branch Unlikely are great on X86 because of the branch predictor gates, but deprecated on MIPS.

Branck likely and unlikely have a very small impact on performance, typically within 1% each. So they should not pose any real issue.
It's also easy to disable them. That's already automatically done for MSVC compilers. Just extend the define to do it all the time should you want to remove them.

> The '1'-bit count instruction on X86, POPCNT, is great. However, it doesn't exist on MIPS.

That's correct.
It's also why there is a custom software bit count method to replace it.
As far as I can tell, it works pretty well. In tests, it proved faster than the easier "compare & branches" method.

> All of this would make LZ4 slower on MIPS than on Intel, therefore the speed of LZ4 may be equal to, or worse than, Deflate on MIPS.

It sure makes LZ4 slower on MIPS, but there is no reason to believe other algorithms remain unaffected by the same side-effects of MIPS processors.
From a comparison standpoint, there is no better way to prove which algorithm is faster than by testing them.

From my perspective, it's unlikely that LZ4 and Deflate get "same speed". Very unlikely. Deflate is much more complex, due to its Huffman processing, and a more memory hungry Hash-Chain methodology. This complexity is what make Deflate compress more than LZ4. But not "faster".

BassAceGold · May 27, 2013

Compressing/Decompressing frames would be an SDK improvement rather than strictly improving this emulator, which is what Nebuleon has primarily taken upon himself to do. That isn't to say he couldn't implement an improved version of the SDK in this emulator, however, modifying the SDK is outside the scope of this project.

If anyone wants to add such a feature to the SDK, they are free to do so. It should remain a separate task from improving the emulator so that when someone eventually does optimize the SDK, the emulator (and many other DS2 projects) would receive a "free" performance gain with a recompile.

Deleted member 319809 · May 27, 2013

BassAceGold said:
Compressing/Decompressing frames would be an SDK improvement rather than strictly improving this emulator, which is what Nebuleon has primarily taken upon himself to do. That isn't to say he couldn't implement an improved version of the SDK in this emulator, however, modifying the SDK is outside the scope of this project.

If anyone wants to add such a feature to the SDK, they are free to do so. It should remain a separate task from improving the emulator so that when someone eventually does optimize the SDK, the emulator (and many other DS2 projects) would receive a "free" performance gain with a recompile.

This is why I kind of wanted TempGBA issue #18 "Save states are written with bad/corrupt short filenames." to be filed against the SDK itself, but there's no real bug tracker for the SDK itself. So I accepted it as a TempGBA bug, even though it's an SDK bug.

Cyan4973 said:
I would like to give some details on a few claims on LZ4 source code.

> To be effective, the compression needs to use more memory.
> In this comment, one can see that the choice of compression buffer size greatly influences the speed. The data cache on MIPS is 8 KiB, and there's no L2 like on Intel.

That's correct, and that's also why the memory parameter can be modified.
For a CPU with a data cache of 8KB, i would recommend to use only 4KB for the Hash table, hence a parameter :
#define MEMORY_USAGE 12

Note that cache effect is obviously not specific to LZ4, any algorithm has to face it.
At least here, the opportunity exists to clearly tune memory requirement to cache setup.

I agree with you here. The deflate algorithm does scale down in memory usage when you use its lower compression levels, but the mapping between compression level and buffer size is not clear.

Cyan4973 said:
> MIPS cannot do unaligned access. As the algorithm requires writing 16-bit and 32-bit values to memory at arbitrary addresses, this will slow down on MIPS.

That's correct.
Here also, the slow down is valid for every other algorithm. There is no reason why LZ4 should be more affected than other algorithm.
Note that LZ4 has been specifically tested and validated on align-only CPU, and has been reported to work well.

I don't know enough about deflate's internals to refute this properly. However, deflate only uses bit manipulation and byte writes edit: when writing the output stream, which don't care about alignment at all.

Cyan4973 said:
> Branch Likely and Branch Unlikely are great on X86 because of the branch predictor gates, but deprecated on MIPS.

Branck likely and unlikely have a very small impact on performance, typically within 1% each. So they should not pose any real issue.
It's also easy to disable them. That's already automatically done for MSVC compilers. Just extend the define to do it all the time should you want to remove them.

Very well - if you say so. (It's worth noting that the DSTwo SDK uses 'gcc', not Visual Studio

)

Cyan4973 said:
> The '1'-bit count instruction on X86, POPCNT, is great. However, it doesn't exist on MIPS.
That's correct.
It's also why there is a custom software bit count method to replace it.
As far as I can tell, it works pretty well. In tests, it proved faster than the easier "compare & branches" method.

Now that I examine the code a bit more closely, it does cleverly use bit twiddling, and should run in 6 instructions per bit-count in software mode: http://code.google.com/p/lz4/source/browse/trunk/lz4.c#317

return DeBruijnBytePos[((U32)((val & -(S32)val) * 0x077CB531U)) >> 27];

where DeBruijnBytePos = loading the address of the array into a register
[] = indexing it
&, -, * and >> are basic operations

Cyan4973 said:
> All of this would make LZ4 slower on MIPS than on Intel, therefore the speed of LZ4 may be equal to, or worse than, Deflate on MIPS.

It sure makes LZ4 slower on MIPS, but there is no reason to believe other algorithms remain unaffected by the same side-effects of MIPS processors.
From a comparison standpoint, there is no better way to prove which algorithm is faster than by testing them.

From my perspective, it's unlikely that LZ4 and Deflate get "same speed". Very unlikely. Deflate is much more complex, due to its Huffman processing, and a more memory hungry Hash-Chain methodology. This complexity is what make Deflate compress more than LZ4. But not "faster".

I respectfully disagree. The Huffman tree population is not that hard on the cache, due to the symbols with the highest frequencies being constantly read as they're already in the cache. And in this application, we do not need speed above all; we would need a balance of speed and compression. The input is also very well suited for compression with Deflate: large runs of identical colours. Hell, we could just do RLE.

I refer to one of the golden rules of optimisation: Know your input.

kane159 · May 28, 2013

hello, i have two games stuck at first screen, but in official dstwo gbaemu it works fine, should i upload it here or theres someplace or email that i can sent?
thanks

Hacking TempGBA: NDSGBA revival

Do GBA games make your nono parts happy?

Yes

No

MAH BOI/GURL

Well-Known Member

MAH BOI/GURL

MAH BOI/GURL

MAH BOI/GURL

Well-Known Member

MAH BOI/GURL

Well-Known Member

MAH BOI/GURL

Well-Known Member

MAH BOI/GURL

Well-Known Member

MAH BOI/GURL

Well-Known Member

MAH BOI/GURL

Well-Known Member

New Member

Testicles

MAH BOI/GURL

Well-Known Member

Similar threads

Popular threads in this forum