Experimental Compression with 50% chance to reduce a file up to 25%

Discussion in 'General Off-Topic Chat' started by SoraK05, Feb 16, 2013.

  1. SoraK05
    OP

    Member SoraK05 GBAtemp Regular

    Joined:
    Aug 24, 2007
    Messages:
    148
    Country:
    Kenya
    EDIT: SEE MY THIRD POST FOR SUMMARIZED CONCEPT


    I was up until the early morning looking at ways to potentially compress a file.
    I have a basic concept of the huffman coding algorithm, which looks for repetition of bytes / patterns in a file, coding the most repetitive patterns with the lowest amount of bits and the least repetitive patterns with the highest amount of bits, in order that the most repetitive patterns will ideally consist of a larger majority of a file and will be represented in few bits.

    What I came up with is slightly different.. It is based on repetition to a degree, but is slightly different to what the huffman algorithm looks for in a sense.

    The idea is that there is a 50% chance that a file can be reduced up to 25% of the original file size.
    It works on frames, which consists of headers and data, and it is potentially possible to run a tweaked huffman coding algorithm on the data portions of the frames to attempt to reduce the file size even more.
    On top of that, as there is a 50% chance that a file can be reduced up to 25%, it is possible to run the compression algorithm on the produced file to reduce the size even more.

    This idea should generally work better on 'random garbage files' than what the huffman coding algorithm can do.. i.e. random data files which will not compress well with something like WinRar should still have a 50% chance to reduce the file size up to 25%.

    I am not good a programming, but I believe the concept is solid.

    I am wondering if there is someone who would perhaps be able to help me to produce the program or have a look at my concept for constructive criticism.
    If anyone is interested in being able to help out with this idea, I can post up my concept.
     
  2. Rydian

    Member Rydian Resident Furvertâ„¢

    Joined:
    Feb 4, 2010
    Messages:
    27,883
    Location:
    Cave Entrance, Watching Cyan Write Letters
    Country:
    United States
    If you're getting ideas from video encoding, most video formats are lossy, meaning they destroy data and the decompressed version is missing data the original version had (this is how they can reduce so much).
     
  3. SoraK05
    OP

    Member SoraK05 GBAtemp Regular

    Joined:
    Aug 24, 2007
    Messages:
    148
    Country:
    Kenya
    EDIT: SEE MY THIRD POST FOR SUMMARIZED CONCEPT


    This here is basically my idea.. I will attach some files.
    The 'marked position repetition FINAL.txt' is the concept, and the 'HEX values for pos_comb.txt' are different HEX values in reference to the concept.

    I am currently experimenting with the idea of taking this somewhat further and modifying it by building all possible HEX value combinations of 8 values long to see if such a database can help making this more effective, as opposed to the 8 varieties of HEX value combinations I wrote..

    The idea is that if bytes follow a pattern based on one of the 8 Mark 0 / Mark 1 options shown in the 'HEX values for pos_comb.txt' very often, it should be effective.

    It is basically similar to huffman coding but gererally making all bytes the equivalent to 6 bits providing it matches a certain sequence determined by a 1-2 byte header per frame to switch and identify the sequence, as opposed to huffman where some bytes can be 3 bits and other 15+ bits.. it makes all bytes 6 bits all round.
     

    Attached Files:

  4. qwertymodo

    Member qwertymodo GBAtemp Advanced Fan

    Joined:
    Feb 1, 2010
    Messages:
    769
    Country:
    United States
    Yeah, good luck trying to improve on current lossless compression algorithms without a serious math background. Not trying to be rude or anything, but it's a problem that huge numbers of insanely smart people have studied for decades. There is far more room for improvement in lossy compression because that's all about compromise, and there's a lot more room for creativity in tricking human senses into believing that they don't notice the missing data than in lossless compression. Advancements in lossless compression at this point come at the cost of pretty insane math...
     
  5. SoraK05
    OP

    Member SoraK05 GBAtemp Regular

    Joined:
    Aug 24, 2007
    Messages:
    148
    Country:
    Kenya
    In a nutshell and after being quite happy with the concept, this is the final result of the concept.

     
  6. Nebuleon

    Member Nebuleon MAH BOI/GURL

    Joined:
    Dec 22, 2012
    Messages:
    900
    Country:
    Canada
    Because it's certainly not text. Maybe you'd save a bit on the 6's and 7's of lower-case ASCII text, but there are a lot of resets needed in such circumstances. The switches from bold to italic in the above text are header switches, so you'd output 7 "headers" for the above 70 bytes, but represent the 70 bytes themselves (560 bits) in 52.5 bytes (420 bits). Headers are 3 bytes (see following explanation to validate this), so my 70 bytes compressed into 73.5 bytes!

    There is a need to represent the 8 hex digits chosen from [0-9A-F] in the header. "From 16, choose 8" yields 12870 possibilities, which need 14 bits (0 to 16383) to represent. Your compressor and decompressor would need a table with these:

    01234567 = 0
    01234568 = 1
    01234578 = 2
    ...
    89ABCDEF = 12870 12869

    And that leaves 1 byte for the "compressed data after this header" length. One byte can represent up to 255 following compressed bytes, so 3 bytes for a header describing 255 bytes stored in 191.25 bytes. The maximum compression ratio is 24.12%.

    The need to store the hex digits in the header as 2 bytes would negate any attempts to compress a compressed file, because it has no relation with the hex digits themselves. So only the absolute worst-case files, like <FF FF FF FF FF FF FF FF ...> would be recompressed. The first pass would turn it into zeroes with some headers, the second pass would include the headers in the hex digits used (let's say you turned <FF FF FF FF FF FF FF FF ...> into [Index 12869] <32 45> [Len 255] <FF> <00 00 00 00 00 ...> on the first pass, well the second pass would use 0 2 3 4 5), and the fourth pass would need to reset in the middle of the previous recompressed headers because they now use more than 9 hex digits.

    I'd say your algorithm could do better on its worst-case files than Huffman, but Huffman will be much better on all other files.
     
  7. Blood Fetish

    Member Blood Fetish Quis custodiet ipsos custodes?

    Joined:
    Nov 3, 2002
    Messages:
    980
    Country:
    United States
    You cannot losslessly compress random data. It is mathematically impossible. All compression works by exploiting patterns.
     
  8. Zetta_x

    Member Zetta_x The Insane Statistician

    Joined:
    Mar 4, 2010
    Messages:
    1,844
    Country:
    United States
    I'm in a statistics program and every week we get statistics seminar from people of different professions and how they utilize statistics (or develop some cool theoretical statistics). One quarter was new compression methods and the error rate associated with it. In short, they take the picture and see if they can derive a sufficient set of data (or a subset of the data that can be used to generate the whole picture) then a mathematical function that translates the l x w size matrix (that represents the data) into a lower dimension (otherwise compression). The better the sufficient set of data, the lower size dimension the new matrix could be. The new matrix is 1-1 and has bits of the sufficient data along with the structure of the old picture so you can reverse back into the original picture. If you compress and already compressed picture, because the newly compressed picture has random noise, re-compressing it will do nothing (aka minimally sufficient).

    if you look up 42.zip and zip bombs, it gives a pretty good idea how awesome compression algorithms work.

    The only way to get better compression algorithms is to sacrifice a good error rate. Lets say X1 and X2 represent data of the new compressed picture. Because the compressed picture is sufficient, X1 and X2 are most likely distinct. However, if they fall within a threshold of certain distinctness, you can classify X1 and X2 as X*. Since you collapsed both of these down into 1, you would have a better compression. However, when you reverse back to the regular image, it will be hard to reverse back into X1 and X2.
     
    Engert likes this.

Share This Page