Experimental Compression with 50% chance to reduce a file up to 25%

SoraK05 · Feb 16, 2013

EDIT: SEE MY THIRD POST FOR SUMMARIZED CONCEPT

I was up until the early morning looking at ways to potentially compress a file.
I have a basic concept of the huffman coding algorithm, which looks for repetition of bytes / patterns in a file, coding the most repetitive patterns with the lowest amount of bits and the least repetitive patterns with the highest amount of bits, in order that the most repetitive patterns will ideally consist of a larger majority of a file and will be represented in few bits.

What I came up with is slightly different.. It is based on repetition to a degree, but is slightly different to what the huffman algorithm looks for in a sense.

The idea is that there is a 50% chance that a file can be reduced up to 25% of the original file size.
It works on frames, which consists of headers and data, and it is potentially possible to run a tweaked huffman coding algorithm on the data portions of the frames to attempt to reduce the file size even more.
On top of that, as there is a 50% chance that a file can be reduced up to 25%, it is possible to run the compression algorithm on the produced file to reduce the size even more.

This idea should generally work better on 'random garbage files' than what the huffman coding algorithm can do.. i.e. random data files which will not compress well with something like WinRar should still have a 50% chance to reduce the file size up to 25%.

I am not good a programming, but I believe the concept is solid.

I am wondering if there is someone who would perhaps be able to help me to produce the program or have a look at my concept for constructive criticism.
If anyone is interested in being able to help out with this idea, I can post up my concept.

Rydian · Feb 16, 2013

If you're getting ideas from video encoding, most video formats are lossy, meaning they destroy data and the decompressed version is missing data the original version had (this is how they can reduce so much).

SoraK05 · Feb 16, 2013

EDIT: SEE MY THIRD POST FOR SUMMARIZED CONCEPT

This here is basically my idea.. I will attach some files.
The 'marked position repetition FINAL.txt' is the concept, and the 'HEX values for pos_comb.txt' are different HEX values in reference to the concept.

I am currently experimenting with the idea of taking this somewhat further and modifying it by building all possible HEX value combinations of 8 values long to see if such a database can help making this more effective, as opposed to the 8 varieties of HEX value combinations I wrote..

The idea is that if bytes follow a pattern based on one of the 8 Mark 0 / Mark 1 options shown in the 'HEX values for pos_comb.txt' very often, it should be effective.

It is basically similar to huffman coding but gererally making all bytes the equivalent to 6 bits providing it matches a certain sequence determined by a 1-2 byte header per frame to switch and identify the sequence, as opposed to huffman where some bytes can be 3 bits and other 15+ bits.. it makes all bytes 6 bits all round.

qwertymodo · Feb 16, 2013

Yeah, good luck trying to improve on current lossless compression algorithms without a serious math background. Not trying to be rude or anything, but it's a problem that huge numbers of insanely smart people have studied for decades. There is far more room for improvement in lossy compression because that's all about compromise, and there's a lot more room for creativity in tricking human senses into believing that they don't notice the missing data than in lossless compression. Advancements in lossless compression at this point come at the cost of pretty insane math...

SoraK05 · Feb 16, 2013

In a nutshell and after being quite happy with the concept, this is the final result of the concept.

I was looking at data and files and wondering how I can come up with a fair compression. I find that a file is generally diverse in this manner and there are multiple means of achieving file conversion from a larger file to a smaller file.

Huffman coding takes a file and counts patterns, reducing some values of patterns to as low as 3 bits and other values of patterns to 15+ bits.

The idea behind what I was thinking is that any byte will be converted from 8 bits to a static 6 bits..

The general idea of this concept is that 3 bits can hold the value 8.
There are 16 possible HEX values when taking 4 bits into account, and a byte consists of 2 of these.

The concept involves creating and identifying frames throughout the file.
The frame will consist of:
* Header * and * Data *.

The * Header *, potentially 3 bytes, will contain a byte length until the next header, and will also contain information on how to represent the * data *.
When taking 8 values of what a HEX of 4 bits can hold, it is half.
It can be 12345678, 56789ABC, etc.
The idea is that every permutation of the 16 values that can be represented by 8 will be identified by the header.

After this, presuming the file being processed consists of 45 43 BD 78 33 23 24 B4 7D for example, it is the same 8 HEX values repeating themselves.
The appropriate permutation of 8 values from 16 will be selected by the header to accommodate for representing the most HEX values in a row possible.

The * data * at this point will be 3 bits per HEX value, as one of the 8 values of the permutation of 16 values.
The above string all contains the same 8 HEX values. Every byte will become 6 bits, essentially reducing the string by 25%.

The efficiency of this concept is that the frame will be intelligent enough to determine the best permutation to provide for 8 repeating HEX values over a length of bytes, and that all data from that point is the equivalent to 3 bits per 4 bits of data. The efficiency is reducing the file by 25% with 2-3 byte headers throughout intelligent enough to deduce the most effective permutation to capture 8 repeating HEX values for a length of bytes that are consistent and repeat.

The more that the same 8 values repeat, the more effective it will be as there will be a need for less frames, and all data can be essentially 1/4'd.

What is also effective about this is the potential that after intelligently producing the smallest file it can produce based on the best permutation calculation in the header, and based on the file being processed, it is possible to run the same process on the produced file again.

Even though the efficiency is 25% reduction, as the program is intelligently designed to capture the most 8 HEX repetition in a string as much as possible and where it is possible, the program can essentially be run again on the encoded file, to perhaps reduce the size of that one again. If the process can be repeated at least 2-3 times with general effectiveness, there is an opportunity for ~50% reduction in size from the original file size, as the program is as fair and as effective as it can be to transform 4 bits to 3 for as long as possible where it is possible.

I am not a programmer and I am far from capable of producing this.
I would appreciate however if someone has an interest in helping to produce this program for me. I would love to see its effectiveness on general files, as well its speed. I suppose a file can be marked somehow in the opening header for example whether it is the 1st, 2nd, 3rd or perhaps higher time the program has been run on the same file, so that the program can automatically decode the file 3 times for example if it was encoded with general success 3 times.

If someone can help to produce this, or help with feedback, I would be very grateful

Thanks.

Deleted member 319809 · Feb 20, 2013

57 68 61 74 20 73 6F 72 74 20 6F 66 | What sort of
20 64 61 74 61 20 64 6F 20 79 6F 75 | data do you
20 65 78 70 65 63 74 20 74 68 69 73 | expect this
20 74 6F 20 77 6F 72 6B 20 62 65 73 | to work bes
74 20 77 69 78 3F | t with?

Because it's certainly not text. Maybe you'd save a bit on the 6's and 7's of lower-case ASCII text, but there are a lot of resets needed in such circumstances. The switches from bold to italic in the above text are header switches, so you'd output 7 "headers" for the above 70 bytes, but represent the 70 bytes themselves (560 bits) in 52.5 bytes (420 bits). Headers are 3 bytes (see following explanation to validate this), so my 70 bytes compressed into 73.5 bytes!

There is a need to represent the 8 hex digits chosen from [0-9A-F] in the header. "From 16, choose 8" yields 12870 possibilities, which need 14 bits (0 to 16383) to represent. Your compressor and decompressor would need a table with these:

01234567 = 0
01234568 = 1
01234578 = 2
...
89ABCDEF = ~~12870~~ 12869

And that leaves 1 byte for the "compressed data after this header" length. One byte can represent up to 255 following compressed bytes, so 3 bytes for a header describing 255 bytes stored in 191.25 bytes. The maximum compression ratio is 24.12%.

The need to store the hex digits in the header as 2 bytes would negate any attempts to compress a compressed file, because it has no relation with the hex digits themselves. So only the absolute worst-case files, like <FF FF FF FF FF FF FF FF ...> would be recompressed. The first pass would turn it into zeroes with some headers, the second pass would include the headers in the hex digits used (let's say you turned <FF FF FF FF FF FF FF FF ...> into [Index 12869] <32 45> [Len 255] <FF> <00 00 00 00 00 ...> on the first pass, well the second pass would use 0 2 3 4 5), and the fourth pass would need to reset in the middle of the previous recompressed headers because they now use more than 9 hex digits.

I'd say your algorithm could do better on its worst-case files than Huffman, but Huffman will be much better on all other files.

Blood Fetish · Feb 22, 2013

You cannot losslessly compress random data. It is mathematically impossible. All compression works by exploiting patterns.

Zetta_x · Feb 22, 2013

I'm in a statistics program and every week we get statistics seminar from people of different professions and how they utilize statistics (or develop some cool theoretical statistics). One quarter was new compression methods and the error rate associated with it. In short, they take the picture and see if they can derive a sufficient set of data (or a subset of the data that can be used to generate the whole picture) then a mathematical function that translates the l x w size matrix (that represents the data) into a lower dimension (otherwise compression). The better the sufficient set of data, the lower size dimension the new matrix could be. The new matrix is 1-1 and has bits of the sufficient data along with the structure of the old picture so you can reverse back into the original picture. If you compress and already compressed picture, because the newly compressed picture has random noise, re-compressing it will do nothing (aka minimally sufficient).

if you look up 42.zip and zip bombs, it gives a pretty good idea how awesome compression algorithms work.

The only way to get better compression algorithms is to sacrifice a good error rate. Lets say X1 and X2 represent data of the new compressed picture. Because the compressed picture is sufficient, X1 and X2 are most likely distinct. However, if they fall within a threshold of certain distinctness, you can classify X1 and X2 as X*. Since you collapsed both of these down into 1, you would have a better compression. However, when you reverse back to the regular image, it will be hard to reverse back into X1 and X2.

Experimental Compression with 50% chance to reduce a file up to 25%

Well-Known Member

Resident Furvert™

Well-Known Member

Attachments

Well-Known Member

Well-Known Member

MAH BOI/GURL

Quis custodiet ipsos custodes?

The Insane Statistician

Similar threads

Popular threads in this forum