I have pseudocode for experimental file compression.

Rydian · Mar 18, 2012

SoraK05 said:
It is guaranteed to work so long as the computer does the brute forcing to 'get the missing information', which could take quite a while.. I need to test - if it takes more than a few days for large files, or takes silly long, then it might not worth it.. however, the file created will always be about 200b and the created file from it will be 1:1 guaranteed - no loss in any way..

I'm sorry, but this idea is bullshit. It cannot work in a reasonable manner. I'm not even talking time. I'm talking that there's a limit on the files you can compress that low without killing the data.

If you have 200 bytes to hold files, then you have a finite limit for the number of files that can be compressed. What happens if you get two files that compress down to the same sequence of bytes? Well, then there's obviously data being destroyed if that happens! There's no way to tell which two versions of a file went into the original one. You're probably thinking along the lines of a hashing function as far as crunching down to a small number of bits that represent a large number, but hashes (by their nature) cannot be reversed.

Here's an example about collision. Two strings, which, when run through MD5, produce the exact same output. Given only the output, how in the hell do you expect the program to know which was the right input? There's no way, given only the 200 bytes of input in your case, if a collision happens.

This is why your statement of everything going down to 200 bytes is not possible. There's a finite number of outputs, and the moment two inputs produce the same output, your compression has destroyed data and is useless.

Compression is good and all, but you need to be able to uncompress too!

SifJar · Mar 18, 2012

The way I see it, he is basically suggested taking several different hashes, generating a new file and bruteforcing until that file matches all the hashes of the original. While there can be collisions in particular hashes, it is, I think, unlikely that any given file will have ALL hashes equal.

It's not a case of recreating the files from the hashes, it's more a case of creating new files until they have the same hashes (and same "sequence number"). I think all those things combined will make it impossible/extremely difficult to have two files compressed to the same 200 byte file.

Thinking about it more, I am not sure if I will be able to create anything or not, I am not sure how I could approach generating the bytes for each file to loop through every possibility. But I will think about it. Now that the idea has come to my mind, I really want to try it, even though I know it'll end up be fairly useless.

Rydian · Mar 18, 2012

If the hashes need to be included with the original file, then that's 200 bytes + the size of the hashes, which is no longer 200 bytes total.

SifJar · Mar 18, 2012

The hashes ARE the 200 byte file.

elenar · Mar 18, 2012

You're trying to use hashes as compression.

It _sounds_ suspiciously like you're trying to reverse engineer a file from its MD5 sum. This can't be done realistically, and probably can't be done literally due to Conservation of Information.

There's complex high-level physics about how much information can be removed while maintaining the ability to reconstruct the original data from the garbage. Diminishing returns on data fidelity after reduction of data density would prevent any such compression algorithm from being able to be decompressed successfully without a large chance of generating garbage data.If you just want to compress a file to its minimum possible size, then make a hash and then store the file on a seperate drive, and "decompress" by copying the file that matches that hash back to the operative drive. This would be a cheaper, faster, and more secure (the second drive could be encrypted physically) method of compression. I know, you're saying "that's not compression", but in terms of cost and effectiveness, it's a superior method to accomplish the same result you're trying to achieve.

Not to mention "deletion." It will reduce the effective file size to zero, and would still have a higher likelihood of successful recovery than the method you describe.

Even in a world where we have computers that can do these things, if a file is of sufficient size and complexity, those 1600 bits could only hold so many different combinations. When trying to compress a series of seemingly random data (much data is seemingly random to begin with) you'd hit the limit of information that those 1600 bits could hold rapidly. If your file is 1GB of particle collision data information in all it's Brownian glory, there's no way (conservation of information) that your 1600 bits can contain all the data necessary to rebuild nearly 9 billion bits of semi-random data.

That's an extreme example, there are many other more realistic examples that show the inability of your method to be functional.

Also, I'm serious. It sounds _very_ much like you're trying to engineer a way to rebuild files from hashes. I can think of functionally zero legitimate uses of that tech. I'd be a bit more careful, if I were you.

Rydian · Mar 19, 2012

SifJar said:
The hashes ARE the 200 byte file.

As pointed out, if it's just hashes, then this idea isn't happening.

gshock · Mar 19, 2012

SifJar said:
It took 446 seconds. Bear in mind that there are 256^102400 (5.921720884 x 10 ^ 246603 according to http://www.alpertron....ar/BIGCALC.HTM) possibilities for this 100KB file . Each attempt would take somewhere around that time. One day is 86400 seconds. That is enough for 193 combinations. Round that up to 200 and you're still not even remotely close to the number of possible files. It would still take more than 2.960860442 x 10 ^ 246601 days. Which just so happens to be 2.960860442 x 10 ^ 246598 years. Doesn't seem worth it for a 100 KB file. And with file size increase, it will increase exponentially because the number of combinations is dependent on 256 to the power of the number of bytes. (Or if you're mathematical, the log of the number of combinations is directly proportional to the number of bytes in the file). So for a 50GB file like you mentioned above, it would take considerably longer than this, not "just" 500 times longer.

Note also that this does not take into account any actual generation of bytes. It is hard coded to use FF for every single byte.

Shouldve used ASM or C++.

Should've not used the file system ( until a hash is found ) because it represents a bottleneck. Instead just allocate memory and for each new combination change a bit or a byte and recalcualte the next hash.

Rydian said:
SoraK05 said:

It is guaranteed to work so long as the computer does the brute forcing to 'get the missing information', which could take quite a while.. I need to test - if it takes more than a few days for large files, or takes silly long, then it might not worth it.. however, the file created will always be about 200b and the created file from it will be 1:1 guaranteed - no loss in any way..

Click to expand...

I'm sorry, but this idea is bullshit. It cannot work in a reasonable manner. I'm not even talking time. I'm talking that there's a limit on the files you can compress that low without killing the data.

If you have 200 bytes to hold files, then you have a finite limit for the number of files that can be compressed. What happens if you get two files that compress down to the same sequence of bytes? Well, then there's obviously data being destroyed if that happens! There's no way to tell which two versions of a file went into the original one. You're probably thinking along the lines of a hashing function as far as crunching down to a small number of bits that represent a large number, but hashes (by their nature) cannot be reversed.

Here's an example about collision. Two strings, which, when run through MD5, produce the exact same output. Given only the output, how in the hell do you expect the program to know which was the right input? There's no way, given only the 200 bytes of input in your case, if a collision happens.

This is why your statement of everything going down to 200 bytes is not possible. There's a finite number of outputs, and the moment two inputs produce the same output, your compression has destroyed data and is useless.

Compression is good and all, but you need to be able to uncompress too!

It's not really compression or compressing anything. The data is discarded and the hash is matched with a bruteforced pattern.

You'd probably have to make your own hashing algorithm that was significantly (A) faster to calculate (B) predictable, in terms of how the hashes correlated with data used to get the hash.

I think you could get around the finite-key-space problem in other ways but brute-force isn't ever going to be realistic.
( for reconstructing unknown data. )

Blood Fetish · Mar 19, 2012

Sorry I came into this thread late. It is mathematically impossible to perform lossless compression on random data. It sounds like that is exactly what your pseudo code would hypothetically do.

SoraK05 · Mar 19, 2012

The file sequence number is basically the file itself - it contains all its data. There cannot be 2 files of the exact same bytes - this is what makes the file unique - it's file sequence number. Even one bit difference will change the file sequence number by 1.
The idea is to get hashes of the original file and hashes of the file sequence number. Then, get an offset of 256^X closest to the file sequence number itself, and use that offset to start brute forcing. Check the file sequence incrementally for its hashes to see if they match, and keep going until all 3 do. The result should the unique file sequence number of the original file and can be used to rebuild it 1:1. If the hashes all match, it is the original file, and if not, continue brute forcing the file sequence until all 3 hashes of it match again.

Using this method, it is guaranteed to eventually rebuild the file 1:1 perfectly. No loss whatsoever - the idea is to brute force to eventually reach the file sequence number, without having to save the entire number as it is very long. The file sequence number will be reached, and building a file with it is in essence building the original file 1:1.

All you need ideally are just the hashes to make it eventually work. The offset speeds it up.
I don't think it's possible that the 'wrong' matched hashes during the brute force sequence number process, even if the number that matches it is in hash cycles previous to the area where the number we are looking in the file sequence for is, will build the wrong file, since the file being built will be checked against the hashes of the original file. They have to match 1:1.

I'm not sure if it is possible that a the wrong match of the file sequence number and the original file can happen at the same time :S

elenar · Mar 19, 2012

SoraK05 said:
The file sequence number is basically the file itself - it contains all its data. There cannot be 2 files of the exact same bytes - this is what makes the file unique - it's file sequence number. Even one bit difference will change the file sequence number by 1.
The idea is to get hashes of the original file and hashes of the file sequence number. Then, get an offset of 256^X closest to the file sequence number itself, and use that offset to start brute forcing. Check the file sequence incrementally for its hashes to see if they match, and keep going until all 3 do. The result should the unique file sequence number of the original file and can be used to rebuild it 1:1. If the hashes all match, it is the original file, and if not, continue brute forcing the file sequence until all 3 hashes of it match again.

Using this method, it is guaranteed to eventually rebuild the file 1:1 perfectly. No loss whatsoever - the idea is to brute force to eventually reach the file sequence number, without having to save the entire number as it is very long. The file sequence number will be reached, and building a file with it is in essence building the original file 1:1.

All you need ideally are just the hashes to make it eventually work. The offset speeds it up.
I don't think it's possible that the 'wrong' matched hashes during the brute force sequence number process, even if the number that matches it is in hash cycles previous to the area where the number we are looking in the file sequence for is, will build the wrong file, since the file being built will be checked against the hashes of the original file. They have to match 1:1.

I'm not sure if it is possible that a the wrong match of the file sequence number and the original file can happen at the same time :S

The bolded part is why this can never work.

Big enough file? Compressed size no longer fits in 200B.

200B pointer for the decompression can only have so much compressed data in it. if file "A" is 200 bytes of zeros, "B" is 200 bytes of 1, etc. You got millions of possibilities, sure. That's not enough. I've already explained why, but you keep insisting that you can reduce anything to 200B. My question is, I give you a file that's 201 bytes of random, non-sequential, unpatterned data. How do you make it into 200 bytes? What's your plan? How do you compress just that one extra byte of random, non-patterned data?

If you'd consider it, or if you have any programming experience of your own (sorry to pull that card, but I want you to understand this), you'd see that you can't just lose it. I'm fairly certain that you don't actually have any real idea as to what the data structure of a "file" is.

Do yourself a favor, do some reading about data storage and file archiving. You might have a feasible good idea about compression and change the internet. Right now, you're just arm-waving. It's always a fun activity, but this time your concept is inherently flawed. Just think of it as having been a fun learning exercise.

SifJar · Mar 19, 2012

Rydian said:
SifJar said:

The hashes ARE the 200 byte file.

Click to expand...

As pointed out, if it's just hashes, then this idea isn't happening.

Quite so, but I think perhaps not for the reasons you think. He is not suggesting at all that you can reconstruct a file from it's hashes. His idea is just to record the hashes and then generate new files until one of them matches all the hashes. In this case, you will get an exact copy of the original file. The reason it won't work is that for files over a few KB it will take years to stumble upon the correct file. Billions of years for over a few hundred KB. As for a few MB or GB, probably not even worth considering.

@[member='elenar']:

He is not suggesting truly compressing a file to 200B. It is more a case of representing a file in 200B. If you have a few different hashes for a file, together they will likely only correspond to one file. You cannot hope to reconstruct said file from those hashes, but you can keep making new files and testing them against those hashes. If it matches, it'll be the same file. The data from the original file is not truly compressed, it is just being represented by a few properties of the original data. The only reason I can see why this is not feasible is because of the mammoth amount of time it would take to find the correct file that matches the hashes.

Rydian · Mar 19, 2012

If more than one input matches the hashes, than there's no way to tell if the first generated file that matches is the correct one.

SifJar · Mar 19, 2012

Rydian said:
If more than one input matches the hashes, than there's no way to tell if the first generated file that matches is the correct one.

Yes, but with multiple hashes collisions are unlikely. If one hash collides on a particular file, chances are at least one of the others won't. For example, say file A and file B have the same MD5 - it is highly unlikely they will also have the same SHA-1. And it'd have to match ALL the hashes to be accepted as valid.

Rydian · Mar 19, 2012

But if it's not made entirely of hashes but is still 200 bytes then even less of those bytes can be used for the final data, and the hashes will be limited as well. That would be more along the lines of "Generated these 10 pre-made hashes" than some new compression idea.

SifJar · Mar 19, 2012

I don't really understand what you're saying but I have already said several times that this is not really compression.

elenar · Mar 19, 2012

SifJar said:
Rydian said:

If more than one input matches the hashes, than there's no way to tell if the first generated file that matches is the correct one.

Click to expand...

Yes, but with multiple hashes collisions are unlikely. If one hash collides on a particular file, chances are at least one of the others won't. For example, say file A and file B have the same MD5 - it is highly unlikely they will also have the same SHA-1. And it'd have to match ALL the hashes to be accepted as valid.

Accepting as a theoretical truth that you could in fact do this (it's not possible, but fine, we'll assume that it is) generating a 500MB file full of random data takes a long time, just based on write speeds alone. The likelihood that your data will be correct is abysmal. It would take centuries, even if you could write rapidly, in most cases. There's no way that this process would be cost-effective in terms of either storage, energy use, time, or any other feasible rubric.

As a proof-of-concept alone, fine. Yes, it's true that if one million monkeys typed away at one million keyboards for one million years, they'd probably generate the data needed for your file, and yes, the method that you're suggesting could be used to easily determine when the monkeys had finally succeeded.

What you're talking about is "how to make a perfect file identification schema that has a huge filesize just for a pointer" not "how to compress something to a small size". Then making computers randomly (or even systematically, which doesn't help much) put together data and using the identifier to know when you succeeded. It's a washout.

Imagine the following scenario.

The "data file" you're trying to recreate is "Hamlet" by William Shakespeare. You could just take your hashes of the data, and make an assembler throw letters together and take a hash until it worked. Guess how astronomically long that would take? So, we generate a dictionary since you know it will be "English" that the data file creates (this is already a cheat, since you can't have such a high fidelity dictionary for data files of thousands of types and formats, but let's just use it as an example). So now, we're only combining words randomly until it works. Still going to take essentially an eternity. That's assuming to begin with that your dictionary actually has every word Shakespeare uses. How does the computer know? If you're missing even one word in your dictionary, it has to go back to brute-force for _every bit of data_ because it doesn't know if _any_ of the words in it's dictionary work. It tried them all, and unless you got the whole file, you got nothing. Try to write a program that brute-forces a 16 character password that includes numbers, and tell me how long it takes. The largest 16 character password would fit in a 32 _bit_ data file.

Now, you could take hashes of every 100KB of data, string them into a file, and bruteforce it from there. That's not really better, but okay. Have your "compression tool" generate the dictionary for that file. Somehow have that dictionary also be under 200bytes. Have the dictionary be part of the archived data.

I'm trying to tell you, do some research into current archiving processes so you understand the theory behind it. Then, do some research about complex mathematics so you can figure out how to get the computer to do these things.

I hate to be rude, but you have no idea what you're talking about with this idea of yours. It's neat, it really is. But you're thinking small-time. If your name is Robert, and we "compress" it by taking out all the vowels, we can have a dictionary that re-fills the word with vowels based on a dictionary, and it won't take much time. Easy, right?

Well how about the name "Dave"? When you "decompress" the string "DV" there are myriad possibilities. How does the computer know which one is correct? Hash, sure. Now generate software that will interpret "ghtvrtgddmnlt" back into plain-text with zero or near-zero chance of loss of data integrity. You see?

Hopefully you're starting to understand data fidelity and conservation of information a little better, and hopefully you're starting to understand the flaws in your methodology a little.

If you want to proof-of-concept your work, here is what you do.

Make a program that generates strings of length 16.
Make that program take several hashes of the generated string.
Make that program store the hashes in some kind of data file, sequentially or something, whatever. Call it .lol or something. In the old days we'd have use ".foo"
Make that program have a fillbox where you can choose a .lol to check against. If a .lol is picked, it compares the temporary .lol from step 2-3 against the selected .lol
Make that program print "I DID IT" along with start and end times in a window when it generates a 16 character string that matches those hashes.
Run that program 1000 times or so to generate an average time and stddev, etc. Also check fidelity of data for each machine success. Successful 100% of the time? I guess it worked.

Then you'll have an average time needed. I believe that was your original question.

The coding discussed above is easy to perform. You could streamline it past that point, but you'd have a proof of concept. Use autohotkey or something. You'll have proof of concept and data to collate. You can go from there to test larger files to make sure it still works and keeps 100% fidelity. If it starts getting too slow, you can see if you can speed it up by using different algorithms or techniques. When you reach a point where diminishing returns (of either software or hardware) is preventing further improvement, you have an Alpha.

Go to it. You might learn something. Like I said, it will be relatively easy. If you are currently taking any technology classes at school or university, talk to your instructor for help and to possibly obtain credit for having done it.

At the very least, you might have some fun.

Good luck.

PS Google will help you find tutorials to script the basic stuff I outlined above. More important you may find some docs about basic programming technique and concepts. Like I keep saying, the basic outline for your idea would be easy to code. Google will also help you learn about Pseudocode, and how it can help you. Go to it!

Rydian · Mar 19, 2012

Sifjar's not the OP, just wanna' point out.

SifJar · Mar 19, 2012

Rydian said:
Sifjar's not the OP, just wanna' point out.

Exactly, and I don't argue with anything any of you are saying. In fact I stated dozens of times it is not feasible.

elenar · Mar 20, 2012

SifJar said:
Rydian said:

Sifjar's not the OP, just wanna' point out.

Click to expand...

Exactly, and I don't argue with anything any of you are saying. In fact I stated dozens of times it is not feasible.

Quoted ya because I was expounding upon your point

I have pseudocode for experimental file compression.

Resident Furvert™

Not a pirate

Resident Furvert™

Not a pirate

Well-Known Member

Resident Furvert™

Well-Known Member

Quis custodiet ipsos custodes?

Well-Known Member

Well-Known Member

Not a pirate

Resident Furvert™

Not a pirate

Resident Furvert™

Not a pirate

Well-Known Member

Resident Furvert™

Not a pirate

Well-Known Member

Similar threads

Popular threads in this forum