I have pseudocode for experimental file compression.

SoraK05 · Mar 18, 2012

I've been thinking about file compression and came up with my own method.

My method gives the computer all the information it needs to do processing on its own and get the rest. All and any file can be converted to about 200 bytes. To get that back to the original file, the computer will need to do a range of combinations that is measurable per file, and perhaps by measuring cpu instructions a timer for maximum time to wait to go through all combinations in the range for that file will give an idea of how long to wait.

I don't know how long it will take to decompress / get the original file back from the 200 byte file. I need some help to make the program so I can experiment on different size files and test incorporating split chunks in the compression stage (about 150 bytes extra per chunk) in case smaller files take less time in general - I don't know how long it will take.

The only way I can really know is if I can get a proof-of-concept - even without the timer coded into it (although it would really help when experimenting with big files).

Can someone help me code this?
It requires some knowledge of the file sequence of a computer and hashes, basically.. and how to measure the max time it will take to do all combinations in the range, in case your file is the last combination in the per file range.

This guarantees to make a very small file, but I don't know if it is worth it in terms of how long it will take - the only way I can have a definitive answer is if someone can help me to make this program..

If you can help me out, I would really appreciate it.
I can describe how the entire thing works in full detail if anyone is interested in helping to make it for experimentation.

Ace · Mar 18, 2012

I dunno how many advanced coders there could be here.... Maybe try your hand over at Reddit? I think more program/game devs hang around there

It sounds like a rather interesting concept. I'd sure be up for 200 byte compression, provided it doesn't destroy my files.

Rydian · Mar 18, 2012

If it can compress files to about 200 bytes, that means it can only compress the number of files that can be represented in 200 bytes. That means it can only compress files up to a certain size/complexity, otherwise you'll run into collissions and then shit's all fucked and the compression's totally useless because it'll be destroying data.

Psionic Roshambo · Mar 18, 2012

I have often wondered why compression programs didn't use modules, essentially large chunks of code compressed. Like 10 GB's of data from Wii games for example where a hunk of compression routine could call on the pre existing chunk of data. Of course this makes the assumption that at least bits of code are reused for Wii games.... Not sure it would work at all but hey its early and I haven't had my coffee yet.

Edit: At the OP sorry but this is probably the 20th time I have seen some one saying "I have this new compression routine..." 200 bytes might get you a 2KB file IF everything was optimal unless its a dummy file. Anything useful is going to be at best 10:1... Most programs tend to hit 2:1 as a realistic ratio depending on whats being compressed.

SoraK05 · Mar 18, 2012

It is guaranteed to work so long as the computer does the brute forcing to 'get the missing information', which could take quite a while.. I need to test - if it takes more than a few days for large files, or takes silly long, then it might not worth it.. however, the file created will always be about 200b and the created file from it will be 1:1 guaranteed - no loss in any way..

I've been working on many angles for a while and come up with one that is guaranteed to work - the only question is whether it is worth it time-wise for the program to do the brute force task (within a specified range calculated based on per file).

Jamstruth · Mar 18, 2012

Your "pseudocode" isn't an algorithm to do this. Its more of a vague description of what you want it to do.
You mention brute forcing to get "missing information". So you'll be destroying large chunks of bits and bruteforcing to fill in the gap. How will the program know when it's got the right data? Brute forcing is never a good way of solving a problem. The amount of time needed increases exponentially for larger and larger bit combinations.

If you have an algorithm then can you post it for scrutiny?

Deleted-185407 · Mar 18, 2012

If I'm understanding correctly:

"Compressed" file contains a SHA1 (or similar) hash of the original file. It also contains additional metadata such as original file length, first few bytes of the original file, and other properties to avoid a hash collision. Your algorithm will dynamically create a new file, and for every byte change, it will hash the new file, and compare it with the original hash. If it matches then the odds are that's the original file.

It does theoretically have a good basis, but the two flaws would be that it would take ages to recreate the original file, and I don't think hash collisions are that easily preventable. While the additional metadata would help tremendously at preventing a collision, it's still possible for a collision to happen.

celcodioc · Mar 18, 2012

FYI, there are 1 099 511 627 776 possible unique byte combinations in a 5 byte file. (256 ^ 5)

A random internet calculator tells me that there are approximately this many possible byte combinations in a 2KB file:

118 97314 95357 23176 50857 59326 62800 71307 63444 68709 65102 37472 67482 12332 61358 18048 36869 04488 59547 26120 39915 11543 74848 39309 25889 76673 81308 68742 62745 24698 34156 50060 80871 63436 60048 97522 14325 16195 31446 84595 23457 09482 13584 70366 47464 83098 47847 14280 96784 56141 38476 04433 84048 86122 90528 68553 13236 15869 59998 85790 10635 70181 20815 36332 07809 64323 71275 71642 90613 40687 52024 17365 32395 02678 80089 06751 73722 70610 83564 75457 55780 79343 16222 13451 90381 78596 30690 31134 38506 57539 36064 96451 93283 17829 17676 58965 40528 51135 56134 36979 32817 25888 01590 84146 75289 83253 80634 19234 88859 98989 80623 11402 51216 74472 05187 24393 21323 19840 29427 05341 36695 12747 39014 59381 68982 88994 44517 34003 64617 92837 71380 74411 34579 18485 73595 07717 04376 44191 74388 96448 85377 68473 83222 40608 23907 90613 99475 67533 47397 84016 49174 26214 85229 01484 76723 35977 89715 83973 34226 34973 48114 41653 07775 82509 88926 03089 47896 04676 15310 42572 60141 80682 30275 88003 44195 14553 27701 59807 12815 89597 16941 39656 08439 50498 31712 55062 28202 66262 00048 04214 98082 00002 06099 34336 81237 62385 78806 27479 72707 28774 82838 43870 50480 34164 63333 70133 85405 99804 07019 08662 38730 16050 18188 26257 37237 66279 24079 89317 17708 80790 17402 65407 93097 64196 48877 86960 40175 17691 93868 79880 88008 94425 12588 26969 68836 41941 33945 78015 78443 64946 05271 36554 54906 32718 74285 31895 10027 86951 19323 49680 87036 30436 19392 75926 92344 82081 28342 97364 47868 68620 64169 04245 85551 36532 05505 05081 89891 86684 68637 99917 64754 72913 71573 50070 10151 97559 09745 30400 33031 52068 35182 16494 19563 66960 77748 11059 82849 01343 61146 92142 74121 81049 50779 79275 55664 51649 83850 06205 10665 17084 64736 94640 36640 56933 94648 37172 18335 29568 73912 04264 00036 11618 78927 81957 10052 09456 27613 06703 55184 03301 10645 10199 54351 67626 68866 96277 63820 60434 24803 57906 41535 42127 32946 75607 30069 07088 87049 61250 50068 15665 92527 61297 66406 54983 47492 66179 88240 62312 21040 92745 84565 58726 48464 17650 16012 31758 74034 72626 19572 89081 46619 76515 53830 74442 47096 98634 75362 77703 56227 12614 50525 49125 22944 80401 49114 79568 13598 75968 51280 85752 44271 87145 54540 84894 98615 50207 94806 98093 92156 58055 31916 56416 81105 96645 41599 51476 90858 31297 21503 29881 65851 42073 06148 08880 21769 81833 84171 29396 87837 14595 75846 05258 31429 28447 24970 36985 48125 29577 59209 36450 02265 14272 49949 58070 82039 66082 84755 09218 91152 13332 10480 11973 88363 65778 25533 32598 88521 56325 43933 50213 15312 13408 13904 51021 25536 37079 03495 91696 31259 24201 16787 71901 08935 25591 45394 88216 89711 79432 69373 60863 90744 72792 75111 67151 27106 39642 50813 53553 13721 35528 90539 80260 29786 45319 79510 09764 32939 09192 46602 28878 91290 06542 10118 28729 82987 07382 15971 71845 69540 51540 30291 73307 29245 43917 89568 67421 96407 61451 17360 06177 52186 99191 33668 37033 88720 15820 71625 86824 71331 04513 31509 72747 13442 72834 06066 42890 40649 66361 04443 21775 28112 27470 02916 28580 93727 70104 96464 99540 22098 39819 32786 61320 42542 26464 24368 96101 07429 92319 76386 81545 83756 17735 35568 98453 60536 27234 42427 71057 60924 86402 37816 29665 52631 49109 06960 48807 34752 17005 12113 63118 70439 92576 25086 66032 56621 37504 16695 71991 96742 23210 60672 47213 73471 23402 16135 40712 18823 99097 01971 94394 43474 80314 21790 38863 17767 77992 15398 92177 33434 43689 07550 31880 08335 46852 34437 03270 89284 14750 16405 89448 48200 12542 37386 68007 44573 41910 93377 48919 59681 01651 60691 06149 90557 24258 10895 58693 88330 67490 20490 03686 24166 30196 85530 05687 04028 50954 50484 84007 35286 43826 57040 37671 57286 51238 02551 09954 51885 70134 76588 18930 00041 38849 71588 31398 66071 54757 48164 76727 63511 64354 62804 40111 27113 92529 18057 07941 93422 68681 83532 12799 06897 22476 97191 47426 81579 12195 97379 41928 07298 88695 23611 00880 26425 88013 20928 04001 19281 53970 80113 07413 39550 00329 90159 24978 25993 69743 58726 28614 39805 20112 45436 92711 14083 74791 90078 03406 59632 13534 17004 06886 94434 05472 14067 59636 40997 40500 92258 03505 67272 64650 95506 26733 92688 92424 36456 18976 61906 89842 41867 70491 03534 40803 99248 32709 79117 12881 14017 03841 82058 60161 47582 84200 75018 35003 29358 49969 18640 66590 53966 07090 69537 38160 18876 79046 65775 96545 88001 93711 77713 44698 32642 87926 22894 33801 61124 45533 53944 70874 62049 76340 91475 42099 24881 55213 95929 38800 77111 72017 89489 77937 06604 27348 09851 61028 81545 87879 11160 97911 34224 33557 54917 09054 42026 39727 56952 83207 30533 18454 19990 74934 78105 24006 19419 72005 91652 14786 71936 96254 33786 49816 03833 14635 42017 00628 81794 71775 18115 21767 43520 16511 17234 77277 27075 22005 61777 48218 92859 71583 46744 54133 71073 58427 75791 96605 62583 88382 32621 78961 69178 72261 18865 63276 49342 88772 40585 97548 77759 86923 55306 53929 93790 11936 11669 00747 23547 46360 76460 18724 42031 37994 41398 24366 82869 87902 12922 99617 41927 28625 89172 00576 12509 34910 04825 45964 15204 64779 25114 44650 07321 64109 09934 52597 99455 69009 55767 88686 39748 70619 48854 74902 48636 07921 85783 42057 93797 18883 47796 56273 47911 23885 85706 42483 63790 72355 41028 67870 18527 40165 39342 19888 36106 19496 71961 05506 86869 61468 01903 56297 49424 08658 71950 41004 40491 52664 76272 76107 05115 68387 06340 12641 36517 23721 14099 16458 79634 76249 49215 90453 39372 10937 52046 57983 00175 40801 75388 62312 71904 23610 37129 33889 65860 28150 04659 60788 72444 36556 44805 45689 03357 59557 02988 39671 97445 28212 98414 25784 83954 00508 42643 27730 84098 54200 21409 06948 54123 20805 26852 00941 46798 87611 04145 83170 39047 39824 88899 22809 18182 13934 28829 56797 17369 94315 24604 47027 29066 99640 66816

So brute forcing a 2KB file would take... well, a lot of time.

EDIT: Also, I'm tired. Feel free to call me an idiot.

SoraK05 · Mar 18, 2012

A hash like md5 has 256^16 combinations, as it's 16 bytes. Every 256^16 combinations, the cycle repeats again. This is the same for other hashes.

I am working with the 'file sequence' of a file. I was experimenting with a program that would take a file and output its file sequence number in a txt file (in decimal)..
A 7b file has a 9b file sequence number, and it get larger and larger as the file size increases.

What I would do is get the MD5, SHA1 and CRC of the original file and of the file sequence number and save them down. I would then get the file sequence number itself / 256^largest divisible number.
For a 35b file, the example was 256^37 - that means the file sequence number is between 256^37 and 256^38. I would record 256^37 as the offset to start the process.

Once all the hashes are saved for the original file and the file sequence, and the largest divisible number (like 256^37 in the example) is recorded, that is all the information that needs to be saved - less than 200b.

From that information, the file sequence would start from the offset of 256^X given, and go through every combination between 256^37 and 256^38 in the example.
What it does with every combination is check the resultant file sequence number for the first byte of the saved hash - if it matches, check the next. Repeat the cycle until not one but all 3 hashes match the file sequence number in process. When that happens, build the file based on the brute-force discovered file sequence number and check the hashes against the original file. If it matches, it is the original file. If not, keep repeating.
Hashes will match every hash cycle range, but all 3 hashes have to match and the file created from that file sequence number discovered must also match.

This system guarantees that with enough brute forcing in a range (like 256^37 - 256^38), your file will eventually be brute-force discovered and reached, and then built 1:1 matching the original file - this is saved as less than 200b.

The only thing is knowing how long it will take, and the computer can give you an estimate based on current calculation speed according to how long it will take to do the max range for that file in particular - the larger the file, the larger the range.

If it is beneficial to use smaller chunks, this system can be used to split the original file sequence number into chunks and save the created file hashes and file sequence chunk hashes, and just run the process with these chunks.

This method guarantees to get the file eventually, within a given range, and will be about 200b. The only question is how long it will take for files in general, and whether or not it is worth it to use chunks in the first place. I don't know how long it will take, but I know this method works - I will only know if it is worth it time-wise in general if the program can be made and tested on different files.

ThatDudeWithTheFood · Mar 18, 2012

SoraK05 said:
A hash like md5 has 256^16 combinations, as it's 16 bytes. Every 256^16 combinations, the cycle repeats again. This is the same for other hashes.

I am working with the 'file sequence' of a file. I was experimenting with a program that would take a file and output its file sequence number in a txt file (in decimal)..
A 7b file has a 9b file sequence number, and it get larger and larger as the file size increases.

What I would do is get the MD5, SHA1 and CRC of the original file and of the file sequence number and save them down. I would then get the file sequence number itself / 256^largest divisible number.
For a 35b file, the example was 256^37 - that means the file sequence number is between 256^37 and 256^38. I would record 256^37 as the offset to start the process.

Once all the hashes are saved for the original file and the file sequence, and the largest divisible number (like 256^37 in the example) is recorded, that is all the information that needs to be saved - less than 200b.

From that information, the file sequence would start from the offset of 256^X given, and go through every combination between 256^37 and 256^38 in the example.
What it does with every combination is check the resultant file sequence number for the first byte of the saved hash - if it matches, check the next. Repeat the cycle until not one but all 3 hashes match the file sequence number in process. When that happens, build the file based on the brute-force discovered file sequence number and check the hashes against the original file. If it matches, it is the original file. If not, keep repeating.
Hashes will match every hash cycle range, but all 3 hashes have to match and the file created from that file sequence number discovered must also match.

This system guarantees that with enough brute forcing in a range (like 256^37 - 256^38), your file will eventually be brute-force discovered and reached, and then built 1:1 matching the original file - this is saved as less than 200b.

The only thing is knowing how long it will take, and the computer can give you an estimate based on current calculation speed according to how long it will take to do the max range for that file in particular - the larger the file, the larger the range.

If it is beneficial to use smaller chunks, this system can be used to split the original file sequence number into chunks and save the created file hashes and file sequence chunk hashes, and just run the process with these chunks.

This method guarantees to get the file eventually, within a given range, and will be about 200b. The only question is how long it will take for files in general, and whether or not it is worth it to use chunks in the first place. I don't know how long it will take, but I know this method works - I will only know if it is worth it time-wise in general if the program can be made and tested on different files.

That is for only one file but when you are compressing things its for size and also neatness.How long would it take to do that with multiple files.

SoraK05 · Mar 18, 2012

It's meant to be done only for one file at a time. You could do it to a RAR for example, containing multiple files..

ThatDudeWithTheFood · Mar 18, 2012

SoraK05 said:
It's meant to be done only for one file at a time. You could do it to a RAR for example, containing multiple files..

Yes but wouldn't that defeat the purpose.
Once you RAR a file you have a large compressed .rar file which you would then have to unrar.
The point of compression is to compress files efficiently and having to decompress twice is very inefficient.

SoraK05 · Mar 18, 2012

It basically turns any file into about 200b.. 4GB, 8GB, 50GB.. doesn't matter - generally speaking, the file will be 200b.. so if you do it to a RAR that has many files, then it can work that way - it will only work with one file at a time - it will take long as it is - multiple files will be insane..
It works best with one file, so use an ISO, a RAR with stuff, etc.

ThatDudeWithTheFood · Mar 18, 2012

SoraK05 said:
It basically turns any file into about 200b.. 4GB, 8GB, 50GB.. doesn't matter - generally speaking, the file will be 200b.. so if you do it to a RAR that has many files, then it can work that way - it will only work with one file at a time - it will take long as it is - multiple files will be insane..
It works best with one file, so use an ISO, a RAR with stuff, etc.

Yes but the bigger the RAR the more pointless it is.
If you have a 4gb RAR file and you make it 200b when you uncompress it you still have to decompress the RAR too.

SifJar · Mar 18, 2012

ThatDudeWithTheFood said:
SoraK05 said:

It basically turns any file into about 200b.. 4GB, 8GB, 50GB.. doesn't matter - generally speaking, the file will be 200b.. so if you do it to a RAR that has many files, then it can work that way - it will only work with one file at a time - it will take long as it is - multiple files will be insane..
It works best with one file, so use an ISO, a RAR with stuff, etc.

Click to expand...

Yes but the bigger the RAR the more pointless it is.
If you have a 4gb RAR file and you make it 200b when you uncompress it you still have to decompress the RAR too.

But it saves a lot of downloading if you're downloading a file. If such an algorithm was implemented within an existing archive program, it could automatically extract the contents of the RAR or whatever after decompressing from the 200b. Or you could use ZIP instead of RAR and then at least Windows users (dunno about Mac, Linux probably depends on distro) wouldn't have to extract the ZIP, because Windows can treat ZIPs as a regular folder as far as the user is concerned.

I think I understand what you are talking about for the most part, and, to me at least, it seems like a decent enough idea. Sadly, I don't know enough programming to be able to help. But I do have a few questions/suggestions. First off, what is the file sequence number? Second, would it not help to record the size of the original file, as well as the hashes of the file + sequence number? So that the program knows how long to make the file to rebuild? I figure that should cut down on the possibilities.

But I think it will still take a huge amount of time, unless you were to split the original file into smaller segments. There are an insanely huge number of possible 4GB (for example) files. To try each one until you reach the right one will take ages. It'll probably take a few seconds/minutes to generate each attempt with that many bytes. And with the huge number of attempts required, it could potentially take ~~days~~ EDIT: billions of years to create the right file (this is a guess, but I will perhaps try and throw together a test to work out roughly how long it takes to generate a file to allow me to calculate a rough figure).

EDIT: I used the following AutoHotKey script (this won't be as optimised as a C program, but it gives a rough idea) to generate a 100KB file (all bytes "ff" for the record):

Code:

;
; AutoHotkey Version: 1.x
; Language:	   English
; Platform:	   Win9x/NT
; Author:		 A.N.Other 
;
; Script Function:
; Template script (you can customize this template by editing "ShellNew\Template.ahk" in your Windows folder)
;

#NoEnv  ; Recommended for performance and compatibility with future AutoHotkey releases.
SendMode Input  ; Recommended for new scripts due to its superior speed and reliability.
SetWorkingDir %A_ScriptDir%  ; Ensures a consistent starting directory.

SetTimer, Timing, 1000
zTime := 0
file = C:\cygwin\z.dat

loop, 102400
{
offset := A_Index - 1
BinWrite(file, "ff", 0, offset)
;MsgBox ErrorLevel = %ErrorLevel%`nBytes Written = %res%
}
msgbox %zTime%

Timing:
zTime := zTime + 1
return

/* ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; BinWrite ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
|  - Open binary file
|  - (Over)Write n bytes (n = 0: all)
|  - From offset (offset < 0: counted from end)
|  - Close file
|  data -> file[offset + 0..n-1], rest of file unchanged
|  Return #bytes actually written
*/ ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

BinWrite(file, data, n=0, offset=0)
{
; Open file for WRITE (0x40..), OPEN_ALWAYS (4): creates only if it does not exists
h := DllCall("CreateFile","str",file,"Uint",0x40000000,"Uint",0,"UInt",0,"UInt",4,"Uint",0,"UInt",0)
IfEqual h,-1, SetEnv, ErrorLevel, -1
IfNotEqual ErrorLevel,0,Return,0 ; couldn't create the file

m = 0							; seek to offset
IfLess offset,0, SetEnv,m,2
r := DllCall("SetFilePointerEx","Uint",h,"Int64",offset,"UInt *",p,"Int",m)
IfEqual r,0, SetEnv, ErrorLevel, -3
IfNotEqual ErrorLevel,0, {
t = %ErrorLevel%			  ; save ErrorLevel to be returned
DllCall("CloseHandle", "Uint", h)
ErrorLevel = %t%			  ; return seek error
Return 0
}

TotalWritten = 0
m := Ceil(StrLen(data)/2)
If (n  m)
n := m
Loop %n%
{
StringLeft c, data, 2		 ; extract next byte
StringTrimLeft data, data, 2  ; remove  used byte
c = 0x%c%					 ; make it number
result := DllCall("WriteFile","UInt",h,"UChar *",c,"UInt",1,"UInt *",Written,"UInt",0)
TotalWritten += Written	   ; count written
if (!result or Written < 1 or ErrorLevel)
break
}

IfNotEqual ErrorLevel,0, SetEnv,t,%ErrorLevel%

h := DllCall("CloseHandle", "Uint", h)
IfEqual h,-1, SetEnv, ErrorLevel, -2
IfNotEqual t,,SetEnv, ErrorLevel, %t%

Return TotalWritten
}

It took 446 seconds. Bear in mind that there are 256^102400 (5.921720884 x 10 ^ 246603 according to http://www.alpertron....ar/BIGCALC.HTM) possibilities for this 100KB file . Each attempt would take somewhere around that time. One day is 86400 seconds. That is enough for 193 combinations. Round that up to 200 and you're still not even remotely close to the number of possible files. It would still take more than 2.960860442 x 10 ^ 246601 days. Which just so happens to be 2.960860442 x 10 ^ 246598 years. Doesn't seem worth it for a 100 KB file. And with file size increase, it will increase exponentially because the number of combinations is dependent on 256 to the power of the number of bytes. (Or if you're mathematical, the log of the number of combinations is directly proportional to the number of bytes in the file). So for a 50GB file like you mentioned above, it would take considerably longer than this, not "just" 500 times longer.

Note also that this does not take into account any actual generation of bytes. It is hard coded to use FF for every single byte.

In short, sadly I would say your concept is not feasible. I wish it was (I have wondered about the possibility of combining different hashes to create an extremely compression file before myself, but never really worked out the logistics of how it would actually work), but it's not. Sorry to be the bringer of bad news.

Perhaps when we have quantum computers.

smealum · Mar 18, 2012

SifJar said:
But I think it will still take a huge amount of time, unless you were to split the original file into smaller segments. There are an insanely huge number of possible 4GB (for example) files. To try each one until you reach the right one will take ages. It'll probably take a few seconds/minutes to generate each attempt with that many bytes. And with the huge number of attempts required, it could potentially take days to create the right file (this is a guess, but I will perhaps try and throw together a test to work out roughly how long it takes to generate a file to allow me to calculate a rough figure).

You're being way too optimistic. I think Rydian's already been over this in another thread, but let's see anyway. 4GB is about 4*8*1024*1024*1024 bits (I guess technically that would be 4GiB but that's beside the point). That means that there are 2^(4*8*1024*1024*1024) possible 4GB files. According to Wolfram, that's about 10^(10^10); for those who don't know, that's a 1 with 10 billion 0s following it. So yeah, it's a lot. So much in fact that even if generating and testing a file were to take, say, only 1 second (which given real world read/write rates is impossibly optimistic), it would still take much longer than the amount of time the universe has been around to test all possibilities.
I don't think Sorak05's algorithm is purely bruteforce though. I didn't really understand what he meant in his explanation so I don't really know. But at one point he speaks of only checking files within a certain range ? So for the sake of argument, let's take a look at this "256^37 - 256^38" range. According to google, 256^38 - 256^37 = 10^91. That's a considerably lower number than previously, but it's still pretty big. Now let's say checking if each "file sequence number" is correct takes just 1ms; in the worst case scenario it would take about (10^91)ms to find the file. Unfortunately, that's still about 10^80 years. Which, again, is fairly longer than the age of the universe (that's about 10^9 years).
So again, I didn't really understand what he meant, so my calculations could be irrelevant. Despite this though, I feel pretty confident when I say that there's no way this could be viable.

Edit : oh come on SifJar, don't ninja me like that.

SoraK05 · Mar 18, 2012

Each file in a computer can be translated to its file sequence number.

If a file is FF, it means it is the 256th combination in the file sequence. FF FF = 65535.. It is the '65535th file' in the file sequence..

Here is an example.
A 35b text file containing this:
"This is a test example of 35 bytes."
OR (as hex)
5468697320697320612074657374206578616D706C65206F662033352062797465732E

Becomes this (as decimal):
193628782805444344641702109281169662577918862165681864394092089254307819541120320003044197
OR (as hex)
1855787F2B1971FDB75E33697C178BE0AC0B68ED8B6248AF3BD088267128CDE82204B52F365

That is the file sequence number of the 35b file.

With a 7b file it is pointless, but files above 200b will do justice to any file and make it ~200b.

This is the best compression I have come up with using the file sequence number of a file as well as the original file.

You get the offset closest to the file sequence number - i.e. 256^highest divisible number.
In the example here, it would be 256^37 as the file sequence number is between 37 and 38..
If the file is larger, you would set it as 256^2000 for example, if it were the case, and would do calculations between 2000 and 2001 - that is the max range.

The larger the file, the larger the range - i.e. the offset gap to the next number is larger.
256^37 - 256^38 is much smaller than 256^2000 -256^2001... either way, it will work off the offset you give it and process until it reaches the next ^.

The only problem is that doing calculations between 256^2000 and 256^2001 could take a very long time.
Extra bytes can be put in to getting the sequence number to be closer to the actual value within 256^37 and ^38 for example, but will take a whole lot more space and make the file almost the same or larger than the original file..

The only way to keep it as small as possible is just to record the offset alone and the hashes, then let the computer brute force.

Only issue here is time - the rest works just fine.
In theory, let's take 256^37 and 256^38.
If you get the difference between them, it is:
32465260872830871999903353159024464441757777524050858994513488411585451964978178754797895680.
If it could do 1 billion calculations in a second, it would mean this would take:
375755334176283240739622143044264634742566869491329386510572782541498286631691 DAYS...
hrm..
I guess if it would take that long with a simple estimate, and only for a 7b file, then it's totally pointless..

That's using 1 billion as an example, which would be a moderate example.
I guess I just showed myself just now that it would definitely not be feasible..

It would work, but take forever :'(

I suppose this can only work with computers hundreds of years away using quantum computing or light or something - if the speed was much much much much faster per second, this may be more feasible..

I suppose I should have done this theoretical time test myself.

SifJar · Mar 18, 2012

smealum said:
Edit : oh come on SifJar, don't ninja me like that.

And yeah, I don't fully understand what he means either. Maybe it isn't a straight brute force, in which case my post isn't entirely applicable, but it is still sort of applicable I reckon. I don't understand what he means by "file sequence number", and google doesn't help much on that either.

EDIT: OK, bit of an explanation of file sequence number there. I'll try and do some more calculations to apply that to my previous calculations. So "file sequence number" = "sum of all bytes in file" then. I guess that changes things.

EDIT: OK, so in the region of 100KB-101KB files, by my calculations, there are 1.51 x 10 ^ 246606 (256 ^ 102401 - 256 ^ 102400 ; is this right?) valid sequence numbers, each corresponding to a different file (well, some may correspond to the same file - multiple files could have the same sequence number [e.g. ff 05 ff would have the same sequence no as 05 ff ff]). I don't think there is anyway of knowing the length of the sequence number (unless that is also recorded), so you'd have to start from 1 and work your way up. So you'd have to try more combinations than that number I think. If you have the correct length stored, you should be able to find the number with that many combinations I guess. But even if you could generate and test them at a rate of 1 per second, it'd take 5.72 x 10 ^ 246602 days to test the sequence numbers, before you even get onto generating and testing the file itself. Now if the sequence number is closer to 256 ^ 102400 than 256 ^ 102401, it'll take less time than that. But it could easily be 1 less than 256 ^ 102401 and therefore take the full time.

FAST6191 · Mar 18, 2012

You seem to be proposing a combination of premade dictionary (think rainbow tables but for file fragments) and file recovery (although you did not elaborate I would look to the oversampling/parity world- http://parchive.sourceforge.net/ and the whole par2 or maybe even some of the RAID recovery ). Each has been experimented with in the past but the amount of processing required for what you want would render it unpleasant even with a modern machine.

Unless you have a specific application for it that works well (someone mentioned wii earlier- despite it being a classic example of encryption making for high entropy files it might do well for scrubbed code and the recovery for the optional scrubbed data- can always generate the files to run the game and then spend time making the padding) or want to do it as a learning/demonstration project (it would make a great concept for a university level project) I would not really push it much further.

SifJar · Mar 18, 2012

After writing the script I did earlier, I have partially convinced myself I might be able to write a rubbish PoC in AHK (obviously it'll be very slow compared with native optimised C code or whatever), and even though it will be useless to almost everyone, I may well do so if I can find the time over the next few weeks. No promises I will ever release anything. It probably won't ever reach the stage of actually working (even ignoring the fact it would take billions of years for files over a few hundred bytes - I may also take some time tomorrow and calculate what size of file I could do in a reasonable time). Just thought I'd let you all know that.

EDIT: Thinking about it, I don't think I could do the file sequence number stuff. Working out the number would be easy enough, but I have no idea how I would go about generating a new file with that same sequence number without generating every possible file and checking the file sequence number, at which point I could just check the hashes of the file instead (which I would have to do after the sequence number anyway). But I might do it with just the hashes of the file. And file length.

I have pseudocode for experimental file compression.

Well-Known Member

GBATemp's Patrick Bateman

Resident Furvert™

Well-Known Member

Well-Known Member

Secondary Feline Anthropomorph

Deleted-185407

Guest

Major A$$hole

Well-Known Member

TRIANGLEZ

Well-Known Member

TRIANGLEZ

Well-Known Member

TRIANGLEZ

Not a pirate

growing up sucks.

Well-Known Member

Not a pirate

Techromancer

Not a pirate

Similar threads

Popular threads in this forum