If more than one input matches the hashes, than there's no way to tell if the first generated file that matches is the correct one.
Yes, but with multiple hashes collisions are unlikely. If one hash collides on a particular file, chances are at least one of the others won't. For example, say file A and file B have the same MD5 - it is highly unlikely they will
also have the same SHA-1. And it'd have to match ALL the hashes to be accepted as valid.
Accepting as a theoretical truth that you could in fact do this (it's not possible, but fine, we'll assume that it is) generating a 500MB file full of random data takes a long time, just based on write speeds alone. The likelihood that your data will be correct is abysmal. It would take centuries, even if you could write rapidly, in most cases. There's no way that this process would be cost-effective in terms of either storage, energy use, time, or any other feasible rubric.
As a proof-of-concept alone, fine. Yes, it's true that if one million monkeys typed away at one million keyboards for one million years, they'd probably generate the data needed for your file, and yes, the method that you're suggesting could be used to easily determine when the monkeys had finally succeeded.
What you're talking about is "how to make a perfect file identification schema that has a huge filesize just for a pointer" not "how to compress something to a small size". Then making computers randomly (or even systematically, which doesn't help much) put together data and using the identifier to know when you succeeded. It's a washout.
Imagine the following scenario.
The "data file" you're trying to recreate is "Hamlet" by William Shakespeare. You could just take your hashes of the data, and make an assembler throw letters together and take a hash until it worked. Guess how astronomically long that would take? So, we generate a dictionary since you know it will be "English" that the data file creates (this is already a cheat, since you can't have such a high fidelity dictionary for data files of thousands of types and formats, but let's just use it as an example). So now, we're only combining words randomly until it works. Still going to take essentially an eternity. That's assuming to begin with that your dictionary actually has every word Shakespeare uses. How does the computer know? If you're missing even one word in your dictionary, it has to go back to brute-force for _every bit of data_ because it doesn't know if _any_ of the words in it's dictionary work. It tried them all, and unless you got the whole file, you got nothing. Try to write a program that brute-forces a 16 character password that includes numbers, and tell me how long it takes. The largest 16 character password would fit in a 32 _bit_ data file.
Now, you could take hashes of every 100KB of data, string them into a file, and bruteforce it from there. That's not really better, but okay. Have your "compression tool" generate the dictionary for that file. Somehow have that dictionary also be under 200bytes. Have the dictionary be part of the archived data.
I'm trying to tell you, do some research into current archiving processes so you understand the theory behind it. Then, do some research about complex mathematics so you can figure out how to get the computer to do these things.
I hate to be rude, but you have no idea what you're talking about with this idea of yours. It's neat, it really is. But you're thinking small-time. If your name is Robert, and we "compress" it by taking out all the vowels, we can have a dictionary that re-fills the word with vowels based on a dictionary, and it won't take much time. Easy, right?
Well how about the name "Dave"? When you "decompress" the string "DV" there are myriad possibilities. How does the computer know which one is correct? Hash, sure. Now generate software that will interpret "ghtvrtgddmnlt" back into plain-text with zero or near-zero chance of loss of data integrity. You see?
Hopefully you're starting to understand data fidelity and conservation of information a little better, and hopefully you're starting to understand the flaws in your methodology a little.
If you want to proof-of-concept your work, here is what you do.
Make a program that generates strings of length 16.
Make that program take several hashes of the generated string.
Make that program store the hashes in some kind of data file, sequentially or something, whatever. Call it .lol or something. In the old days we'd have use ".foo"
Make that program have a fillbox where you can choose a .lol to check against. If a .lol is picked, it compares the temporary .lol from step 2-3 against the selected .lol
Make that program print "I DID IT" along with start and end times in a window when it generates a 16 character string that matches those hashes.
Run that program 1000 times or so to generate an average time and stddev, etc. Also check fidelity of data for each machine success. Successful 100% of the time? I guess it worked.
Then you'll have an average time needed. I believe that was your original question.
The coding discussed above is easy to perform. You could streamline it past that point, but you'd have a proof of concept. Use autohotkey or something. You'll have proof of concept and data to collate. You can go from there to test larger files to make sure it still works and keeps 100% fidelity. If it starts getting too slow, you can see if you can speed it up by using different algorithms or techniques. When you reach a point where diminishing returns (of either software or hardware) is preventing further improvement, you have an Alpha.
Go to it. You might learn something. Like I said, it will be relatively easy. If you are currently taking any technology classes at school or university, talk to your instructor for help and to possibly obtain credit for having done it.
At the very least, you might have some fun.
Good luck.
PS Google will help you find tutorials to script the basic stuff I outlined above. More important you may find some docs about basic programming technique and concepts. Like I keep saying, the basic outline for your idea would be easy to code. Google will also help you learn about Pseudocode, and how it can help you. Go to it!