I have pseudocode for experimental file compression.

Discussion in 'General Off-Topic Chat' started by SoraK05, Mar 18, 2012.

  1. SoraK05
    OP

    Member SoraK05 GBAtemp Regular

    Joined:
    Aug 24, 2007
    Messages:
    148
    Country:
    Kenya
    I've been thinking about file compression and came up with my own method.

    My method gives the computer all the information it needs to do processing on its own and get the rest. All and any file can be converted to about 200 bytes. To get that back to the original file, the computer will need to do a range of combinations that is measurable per file, and perhaps by measuring cpu instructions a timer for maximum time to wait to go through all combinations in the range for that file will give an idea of how long to wait.

    I don't know how long it will take to decompress / get the original file back from the 200 byte file. I need some help to make the program so I can experiment on different size files and test incorporating split chunks in the compression stage (about 150 bytes extra per chunk) in case smaller files take less time in general - I don't know how long it will take.

    The only way I can really know is if I can get a proof-of-concept - even without the timer coded into it (although it would really help when experimenting with big files).


    Can someone help me code this?
    It requires some knowledge of the file sequence of a computer and hashes, basically.. and how to measure the max time it will take to do all combinations in the range, in case your file is the last combination in the per file range.


    This guarantees to make a very small file, but I don't know if it is worth it in terms of how long it will take - the only way I can have a definitive answer is if someone can help me to make this program..

    If you can help me out, I would really appreciate it.
    I can describe how the entire thing works in full detail if anyone is interested in helping to make it for experimentation.
     


  2. Ace

    Member Ace GBATemp's Patrick Bateman

    Joined:
    Apr 8, 2009
    Messages:
    1,035
    Location:
    Manhattan
    Country:
    Sweden
    I dunno how many advanced coders there could be here.... Maybe try your hand over at Reddit? I think more program/game devs hang around there :lol:
    It sounds like a rather interesting concept. I'd sure be up for 200 byte compression, provided it doesn't destroy my files.
     
  3. Rydian

    Member Rydian Resident Furvertâ„¢

    Joined:
    Feb 4, 2010
    Messages:
    27,883
    Location:
    Cave Entrance, Watching Cyan Write Letters
    Country:
    United States
    If it can compress files to about 200 bytes, that means it can only compress the number of files that can be represented in 200 bytes. That means it can only compress files up to a certain size/complexity, otherwise you'll run into collissions and then shit's all fucked and the compression's totally useless because it'll be destroying data.
     
    1 person likes this.
  4. Psionic Roshambo

    Member Psionic Roshambo GBAtemp Advanced Maniac

    Joined:
    Aug 12, 2011
    Messages:
    1,707
    Country:
    United States
    I have often wondered why compression programs didn't use modules, essentially large chunks of code compressed. Like 10 GB's of data from Wii games for example where a hunk of compression routine could call on the pre existing chunk of data. Of course this makes the assumption that at least bits of code are reused for Wii games.... Not sure it would work at all but hey its early and I haven't had my coffee yet.

    Edit: At the OP sorry but this is probably the 20th time I have seen some one saying "I have this new compression routine..." 200 bytes might get you a 2KB file IF everything was optimal unless its a dummy file. Anything useful is going to be at best 10:1... Most programs tend to hit 2:1 as a realistic ratio depending on whats being compressed.
     
  5. SoraK05
    OP

    Member SoraK05 GBAtemp Regular

    Joined:
    Aug 24, 2007
    Messages:
    148
    Country:
    Kenya
    It is guaranteed to work so long as the computer does the brute forcing to 'get the missing information', which could take quite a while.. I need to test - if it takes more than a few days for large files, or takes silly long, then it might not worth it.. however, the file created will always be about 200b and the created file from it will be 1:1 guaranteed - no loss in any way..

    I've been working on many angles for a while and come up with one that is guaranteed to work - the only question is whether it is worth it time-wise for the program to do the brute force task (within a specified range calculated based on per file).
     
  6. Jamstruth

    Member Jamstruth Secondary Feline Anthropomorph

    Joined:
    Apr 23, 2009
    Messages:
    3,456
    Location:
    North East Scotland
    Country:
    United Kingdom
    Your "pseudocode" isn't an algorithm to do this. Its more of a vague description of what you want it to do.
    You mention brute forcing to get "missing information". So you'll be destroying large chunks of bits and bruteforcing to fill in the gap. How will the program know when it's got the right data? Brute forcing is never a good way of solving a problem. The amount of time needed increases exponentially for larger and larger bit combinations.

    If you have an algorithm then can you post it for scrutiny?
     
  7. Peps

    Member Peps GBAtemp Regular

    Joined:
    Jun 28, 2009
    Messages:
    240
    Country:
    Ireland
    If I'm understanding correctly:

    "Compressed" file contains a SHA1 (or similar) hash of the original file. It also contains additional metadata such as original file length, first few bytes of the original file, and other properties to avoid a hash collision. Your algorithm will dynamically create a new file, and for every byte change, it will hash the new file, and compare it with the original hash. If it matches then the odds are that's the original file.

    It does theoretically have a good basis, but the two flaws would be that it would take ages to recreate the original file, and I don't think hash collisions are that easily preventable. While the additional metadata would help tremendously at preventing a collision, it's still possible for a collision to happen.
     
  8. celcodioc

    Member celcodioc Major A$$hole

    Joined:
    Nov 13, 2011
    Messages:
    278
    Country:
    Sweden
    FYI, there are 1 099 511 627 776 possible unique byte combinations in a 5 byte file. (256 ^ 5)

    A random internet calculator tells me that there are approximately this many possible byte combinations in a 2KB file:
    Warning: Spoilers inside!

    So brute forcing a 2KB file would take... well, a lot of time.



    EDIT: Also, I'm tired. Feel free to call me an idiot.
     
  9. SoraK05
    OP

    Member SoraK05 GBAtemp Regular

    Joined:
    Aug 24, 2007
    Messages:
    148
    Country:
    Kenya
    A hash like md5 has 256^16 combinations, as it's 16 bytes. Every 256^16 combinations, the cycle repeats again. This is the same for other hashes.

    I am working with the 'file sequence' of a file. I was experimenting with a program that would take a file and output its file sequence number in a txt file (in decimal)..
    A 7b file has a 9b file sequence number, and it get larger and larger as the file size increases.

    What I would do is get the MD5, SHA1 and CRC of the original file and of the file sequence number and save them down. I would then get the file sequence number itself / 256^largest divisible number.
    For a 35b file, the example was 256^37 - that means the file sequence number is between 256^37 and 256^38. I would record 256^37 as the offset to start the process.

    Once all the hashes are saved for the original file and the file sequence, and the largest divisible number (like 256^37 in the example) is recorded, that is all the information that needs to be saved - less than 200b.

    From that information, the file sequence would start from the offset of 256^X given, and go through every combination between 256^37 and 256^38 in the example.
    What it does with every combination is check the resultant file sequence number for the first byte of the saved hash - if it matches, check the next. Repeat the cycle until not one but all 3 hashes match the file sequence number in process. When that happens, build the file based on the brute-force discovered file sequence number and check the hashes against the original file. If it matches, it is the original file. If not, keep repeating.
    Hashes will match every hash cycle range, but all 3 hashes have to match and the file created from that file sequence number discovered must also match.

    This system guarantees that with enough brute forcing in a range (like 256^37 - 256^38), your file will eventually be brute-force discovered and reached, and then built 1:1 matching the original file - this is saved as less than 200b.


    The only thing is knowing how long it will take, and the computer can give you an estimate based on current calculation speed according to how long it will take to do the max range for that file in particular - the larger the file, the larger the range.

    If it is beneficial to use smaller chunks, this system can be used to split the original file sequence number into chunks and save the created file hashes and file sequence chunk hashes, and just run the process with these chunks.

    This method guarantees to get the file eventually, within a given range, and will be about 200b. The only question is how long it will take for files in general, and whether or not it is worth it to use chunks in the first place. I don't know how long it will take, but I know this method works - I will only know if it is worth it time-wise in general if the program can be made and tested on different files.
     
  10. ThatDudeWithTheFood

    Member ThatDudeWithTheFood TRIANGLEZ

    Joined:
    Mar 9, 2009
    Messages:
    2,198
    Location:
    Illuminati
    Country:
    United States
    That is for only one file but when you are compressing things its for size and also neatness.How long would it take to do that with multiple files.
     
  11. SoraK05
    OP

    Member SoraK05 GBAtemp Regular

    Joined:
    Aug 24, 2007
    Messages:
    148
    Country:
    Kenya
    It's meant to be done only for one file at a time. You could do it to a RAR for example, containing multiple files.. :)
     
  12. ThatDudeWithTheFood

    Member ThatDudeWithTheFood TRIANGLEZ

    Joined:
    Mar 9, 2009
    Messages:
    2,198
    Location:
    Illuminati
    Country:
    United States
    Yes but wouldn't that defeat the purpose.
    Once you RAR a file you have a large compressed .rar file which you would then have to unrar.
    The point of compression is to compress files efficiently and having to decompress twice is very inefficient.
     
  13. SoraK05
    OP

    Member SoraK05 GBAtemp Regular

    Joined:
    Aug 24, 2007
    Messages:
    148
    Country:
    Kenya
    It basically turns any file into about 200b.. 4GB, 8GB, 50GB.. doesn't matter - generally speaking, the file will be 200b.. so if you do it to a RAR that has many files, then it can work that way - it will only work with one file at a time - it will take long as it is - multiple files will be insane..
    It works best with one file, so use an ISO, a RAR with stuff, etc.
     
  14. ThatDudeWithTheFood

    Member ThatDudeWithTheFood TRIANGLEZ

    Joined:
    Mar 9, 2009
    Messages:
    2,198
    Location:
    Illuminati
    Country:
    United States
    Yes but the bigger the RAR the more pointless it is.
    If you have a 4gb RAR file and you make it 200b when you uncompress it you still have to decompress the RAR too.
     
  15. SifJar

    Member SifJar Not a pirate

    Joined:
    Apr 4, 2009
    Messages:
    6,022
    Country:
    United Kingdom
    But it saves a lot of downloading if you're downloading a file. If such an algorithm was implemented within an existing archive program, it could automatically extract the contents of the RAR or whatever after decompressing from the 200b. Or you could use ZIP instead of RAR and then at least Windows users (dunno about Mac, Linux probably depends on distro) wouldn't have to extract the ZIP, because Windows can treat ZIPs as a regular folder as far as the user is concerned.

    I think I understand what you are talking about for the most part, and, to me at least, it seems like a decent enough idea. Sadly, I don't know enough programming to be able to help. But I do have a few questions/suggestions. First off, what is the file sequence number? Second, would it not help to record the size of the original file, as well as the hashes of the file + sequence number? So that the program knows how long to make the file to rebuild? I figure that should cut down on the possibilities.

    But I think it will still take a huge amount of time, unless you were to split the original file into smaller segments. There are an insanely huge number of possible 4GB (for example) files. To try each one until you reach the right one will take ages. It'll probably take a few seconds/minutes to generate each attempt with that many bytes. And with the huge number of attempts required, it could potentially take days EDIT: billions of years to create the right file (this is a guess, but I will perhaps try and throw together a test to work out roughly how long it takes to generate a file to allow me to calculate a rough figure).

    EDIT: I used the following AutoHotKey script (this won't be as optimised as a C program, but it gives a rough idea) to generate a 100KB file (all bytes "ff" for the record):

    Warning: Spoilers inside!
    It took 446 seconds. Bear in mind that there are 256^102400 (5.921720884 x 10 ^ 246603 according to http://www.alpertron....ar/BIGCALC.HTM) possibilities for this 100KB file . Each attempt would take somewhere around that time. One day is 86400 seconds. That is enough for 193 combinations. Round that up to 200 and you're still not even remotely close to the number of possible files. It would still take more than 2.960860442 x 10 ^ 246601 days. Which just so happens to be 2.960860442 x 10 ^ 246598 years. Doesn't seem worth it for a 100 KB file. And with file size increase, it will increase exponentially because the number of combinations is dependent on 256 to the power of the number of bytes. (Or if you're mathematical, the log of the number of combinations is directly proportional to the number of bytes in the file). So for a 50GB file like you mentioned above, it would take considerably longer than this, not "just" 500 times longer.

    Note also that this does not take into account any actual generation of bytes. It is hard coded to use FF for every single byte.

    In short, sadly I would say your concept is not feasible. I wish it was (I have wondered about the possibility of combining different hashes to create an extremely compression file before myself, but never really worked out the logistics of how it would actually work), but it's not. Sorry to be the bringer of bad news.

    Perhaps when we have quantum computers.
     
    1 person likes this.
  16. smealum

    Member smealum growing up sucks.

    Joined:
    May 1, 2006
    Messages:
    627
    Location:
    SF
    Country:
    United States
    You're being way too optimistic. I think Rydian's already been over this in another thread, but let's see anyway. 4GB is about 4*8*1024*1024*1024 bits (I guess technically that would be 4GiB but that's beside the point). That means that there are 2^(4*8*1024*1024*1024) possible 4GB files. According to Wolfram, that's about 10^(10^10); for those who don't know, that's a 1 with 10 billion 0s following it. So yeah, it's a lot. So much in fact that even if generating and testing a file were to take, say, only 1 second (which given real world read/write rates is impossibly optimistic), it would still take much longer than the amount of time the universe has been around to test all possibilities.
    I don't think Sorak05's algorithm is purely bruteforce though. I didn't really understand what he meant in his explanation so I don't really know. But at one point he speaks of only checking files within a certain range ? So for the sake of argument, let's take a look at this "256^37 - 256^38" range. According to google, 256^38 - 256^37 = 10^91. That's a considerably lower number than previously, but it's still pretty big. Now let's say checking if each "file sequence number" is correct takes just 1ms; in the worst case scenario it would take about (10^91)ms to find the file. Unfortunately, that's still about 10^80 years. Which, again, is fairly longer than the age of the universe (that's about 10^9 years).
    So again, I didn't really understand what he meant, so my calculations could be irrelevant. Despite this though, I feel pretty confident when I say that there's no way this could be viable.

    Edit : oh come on SifJar, don't ninja me like that.
     
  17. SoraK05
    OP

    Member SoraK05 GBAtemp Regular

    Joined:
    Aug 24, 2007
    Messages:
    148
    Country:
    Kenya
    Each file in a computer can be translated to its file sequence number.

    If a file is FF, it means it is the 256th combination in the file sequence. FF FF = 65535.. It is the '65535th file' in the file sequence..

    Here is an example.
    A 35b text file containing this:
    "This is a test example of 35 bytes."
    OR (as hex)
    5468697320697320612074657374206578616D706C65206F662033352062797465732E

    Becomes this (as decimal):
    193628782805444344641702109281169662577918862165681864394092089254307819541120320003044197
    OR (as hex)
    1855787F2B1971FDB75E33697C178BE0AC0B68ED8B6248AF3BD088267128CDE82204B52F365

    That is the file sequence number of the 35b file.


    With a 7b file it is pointless, but files above 200b will do justice to any file and make it ~200b.

    This is the best compression I have come up with using the file sequence number of a file as well as the original file.

    You get the offset closest to the file sequence number - i.e. 256^highest divisible number.
    In the example here, it would be 256^37 as the file sequence number is between 37 and 38..
    If the file is larger, you would set it as 256^2000 for example, if it were the case, and would do calculations between 2000 and 2001 - that is the max range.

    The larger the file, the larger the range - i.e. the offset gap to the next number is larger.
    256^37 - 256^38 is much smaller than 256^2000 -256^2001... either way, it will work off the offset you give it and process until it reaches the next ^.

    The only problem is that doing calculations between 256^2000 and 256^2001 could take a very long time.
    Extra bytes can be put in to getting the sequence number to be closer to the actual value within 256^37 and ^38 for example, but will take a whole lot more space and make the file almost the same or larger than the original file..

    The only way to keep it as small as possible is just to record the offset alone and the hashes, then let the computer brute force.

    Only issue here is time - the rest works just fine.
    In theory, let's take 256^37 and 256^38.
    If you get the difference between them, it is:
    32465260872830871999903353159024464441757777524050858994513488411585451964978178754797895680.
    If it could do 1 billion calculations in a second, it would mean this would take:
    375755334176283240739622143044264634742566869491329386510572782541498286631691 DAYS...
    hrm..
    I guess if it would take that long with a simple estimate, and only for a 7b file, then it's totally pointless..

    That's using 1 billion as an example, which would be a moderate example.
    I guess I just showed myself just now that it would definitely not be feasible..

    It would work, but take forever :'(

    I suppose this can only work with computers hundreds of years away using quantum computing or light or something - if the speed was much much much much faster per second, this may be more feasible..

    I suppose I should have done this theoretical time test myself. :) :)
     
  18. SifJar

    Member SifJar Not a pirate

    Joined:
    Apr 4, 2009
    Messages:
    6,022
    Country:
    United Kingdom
    :ninja:

    And yeah, I don't fully understand what he means either. Maybe it isn't a straight brute force, in which case my post isn't entirely applicable, but it is still sort of applicable I reckon. I don't understand what he means by "file sequence number", and google doesn't help much on that either.

    EDIT: OK, bit of an explanation of file sequence number there. I'll try and do some more calculations to apply that to my previous calculations. So "file sequence number" = "sum of all bytes in file" then. I guess that changes things.

    EDIT: OK, so in the region of 100KB-101KB files, by my calculations, there are 1.51 x 10 ^ 246606 (256 ^ 102401 - 256 ^ 102400 ; is this right?) valid sequence numbers, each corresponding to a different file (well, some may correspond to the same file - multiple files could have the same sequence number [e.g. ff 05 ff would have the same sequence no as 05 ff ff]). I don't think there is anyway of knowing the length of the sequence number (unless that is also recorded), so you'd have to start from 1 and work your way up. So you'd have to try more combinations than that number I think. If you have the correct length stored, you should be able to find the number with that many combinations I guess. But even if you could generate and test them at a rate of 1 per second, it'd take 5.72 x 10 ^ 246602 days to test the sequence numbers, before you even get onto generating and testing the file itself. Now if the sequence number is closer to 256 ^ 102400 than 256 ^ 102401, it'll take less time than that. But it could easily be 1 less than 256 ^ 102401 and therefore take the full time.
     
  19. FAST6191

    Reporter FAST6191 Techromancer

    pip
    Joined:
    Nov 21, 2005
    Messages:
    21,716
    Country:
    United Kingdom
    You seem to be proposing a combination of premade dictionary (think rainbow tables but for file fragments) and file recovery (although you did not elaborate I would look to the oversampling/parity world- http://parchive.sourceforge.net/ and the whole par2 or maybe even some of the RAID recovery ). Each has been experimented with in the past but the amount of processing required for what you want would render it unpleasant even with a modern machine.

    Unless you have a specific application for it that works well (someone mentioned wii earlier- despite it being a classic example of encryption making for high entropy files it might do well for scrubbed code and the recovery for the optional scrubbed data- can always generate the files to run the game and then spend time making the padding) or want to do it as a learning/demonstration project (it would make a great concept for a university level project) I would not really push it much further.
     
    1 person likes this.
  20. SifJar

    Member SifJar Not a pirate

    Joined:
    Apr 4, 2009
    Messages:
    6,022
    Country:
    United Kingdom
    After writing the script I did earlier, I have partially convinced myself I might be able to write a rubbish PoC in AHK (obviously it'll be very slow compared with native optimised C code or whatever), and even though it will be useless to almost everyone, I may well do so if I can find the time over the next few weeks. No promises I will ever release anything. It probably won't ever reach the stage of actually working (even ignoring the fact it would take billions of years for files over a few hundred bytes - I may also take some time tomorrow and calculate what size of file I could do in a reasonable time). Just thought I'd let you all know that.

    EDIT: Thinking about it, I don't think I could do the file sequence number stuff. Working out the number would be easy enough, but I have no idea how I would go about generating a new file with that same sequence number without generating every possible file and checking the file sequence number, at which point I could just check the hashes of the file instead (which I would have to do after the sequence number anyway). But I might do it with just the hashes of the file. And file length.
     

Share This Page