Anyone interested in assisting to program a systematic compression tool?

Discussion in 'General Off-Topic Chat' started by SoraK05, May 4, 2014.

  1. SoraK05
    OP

    SoraK05 GBAtemp Regular

    Member
    155
    13
    Aug 24, 2007
    Kenya
    I've been thinking of compression for some time.

    When looking at a BMP file in general it will consist of 24 bit strings for each pixel.
    Considering a picture of 10 colours used, this will result in repetition of 10 variations of 24 bit strings.


    I believe that in general any file can be compressed in a systematic manner to its 'best', without 'compression levels'.

    There are 3 general ways to detect compression for a file.
    1) Set frames
    2) In a row
    3) Throughout the file

    Set frames is like taking the BMP and its 24 bit variations, and as there are 10 variations used, making each 24 bit sequence into 4 bits to fit all 10 variations, and use 4 bits instead for each 24 bit string.


    In a row is like looking for repetition of the same sequences, such as one colour which repeats for a chunk of the file, and using some information to store its sequence, offset, repetition and remove it from the file completely (and later be able to restore it based on the offset).


    Throughout the file is like a general huffman style system which will specifically target repetition of sequences in the file and rewrite these on a code tree of sequences which vary in length and the idea is the more frequent ones will use less bits than the less frequent ones and should allow for overall compression this way (and the code tree sequences cannot be confused reading the file from left to right).



    While each way of detecting compression allows for compression, each way will generally get compression that other ways will not.

    Set frames will be able to reduce the overall file.
    In a row will be able to get large chunks in a section (such as an optimal method for a 1GB file of 0 bits)
    Throughout will get general repetition of specific sequences and recode it.



    Throughout (like huffman) when done alone can leave repetition of coded sequences in a row if the repetition is prevalent in a row (like a 1GB file of 0 bits).

    To fix that, a scan for 'in a row' in advance can prevent this where possible, leaving the throughout scan to simply find repetition throughout and between gaps and not write a file that has repetition in a row from its source having it.


    By shrinking frames based on sequences used into less bits to represent, the in a row scan will use shorter offsets, sequences and repetition data to be stored in general.
    Also, the throughout one (like huffman) will be done on a shrunk file of generally organized and common data as well as without finding sequences in a row.


    The overall effect of doing 'set frames' then 'in a row' then 'throughout' is that they will inherit and result in a generally compressed file with inheritance from previous scans.


    The idea is that each one of the methods, which are called 'MODES' will be thorough in their mannerism.
    For set frames, it will not simply look for all sequences in the file on the whole.
    It will specifically target areas in the file which can be shrunk and known by its offset locations, such as a dynamic method of shrinking areas by its sequence variations into less bits.

    For example, a file can have prevalence of 5 24 bit sequences in multiple areas like 0x20-0x100, 0x200-0x500, 0x1000-0x2500 etc - it will shrink those specific areas and associate them to the same 'section', and thus dynamically shrink areas in a manner which will produce the most overall compression.


    For in a row, it will use sequences which are long enough in a row that compliment more compression if done in conjunction to throughout when followed, instead of 48 0 bits in a row which could potentially be eligible for detection of a sequence that can be stored in a row - it will look for the shortest version of a repetitive sequence in a row that is 'worth recording'.

    For throughout, it will determine the sequences which will overall save the most bits in the file on the whole, blank their offsets out and do rescans in this manner to end up with the least number of the most bit saving sequences used to assemble a file. The code tree will be constructed not by frequency but on ordering the most bit saving sequences to use the least bits on the code tree.


    As each method is considered optimized, there are no 'compression levels'.
    It is 3 optimized algorithms of detecting compression which when run one after the other with priority will overall compress the file with inheritance.

    Since each of the 3 MODES are looking for repetition and compression in multiple ways, this is generally for all files on the spectrum where repetition can be found.


    For each section found by MODE 1, such as a particular section like the 24 bit strings of 5 variations in the multiple areas, all areas are treated as the same section, and when using the 'in a row' and 'throughout' MODES following are done on each section as though it is a continuous string, adjusting offset information where required.

    The result is 'in a row' and 'throughout' are done on shrunk sections with specific common data to encourage more data to be found 'in a row' and for throughout to be done on common data.

    This will have more compression that doing 'in a row' or 'throughout' on the whole file even if it was shrunk into sections since the 'in a row' is specifically targeting common data that has been detected in advance and treated as continuous in the file, and the 'throughout' is looking for repetition in an area targeted to have few sequences and prevalence of them.


    When each MODE is done once, one after the other (and in each respective section), the result is something that can generally be streamable in terms of decoding on the fly.

    When each MODE is repeated more than once where possible, the result is a file like a russian doll with a doll in a doll, and the file is the smallest doll and the data to make all the larger ones.

    This can for example get more data 'in a row' if done twice based on the result of removing the initial 'in a row' data, and treating the file like an accordion to get more compression where possible using the 3 optimized MODES of detecting compression (which compliment).


    I have a writeup (which needs some adjusting - my prediction for compression is too high in general in the writeup), but general pseudocode is there and an expectation of having an overall highly compressed file (without compression levels), having a highly compressed 'streamable' file and highly compressed 'archival' file.
    As there are no compression levels, and each MODE will output the 'ideal' result being the least bit version to output possible, it can systematically compress files in general where possible and by multiple methods of detecting compression.


    Additional information can be put in the header such as resolution/bitrate to allow for the data being worked on to be raw data such as raw image or raw audio data, for overall maximum compression results. Additional information such as minor information required for synchronization can be done for immediate decoding of specific offsets (files) when using the streamable scheme.

    The concept is a systematic manner of compressing a file using multiple complimenting and inheriting methods of detecting compression, and only using a MODE where it is detected in advance that compression will occur - scans for MODE 1 and MODE 2 will be done and if only MODE 3 will find compression, that is the only one used.


    For this reason, it is generally calculation intensive and requires generous temporary storage to do - the result is a generally optimized file which can generally be accessed immediately.


    While the writeup can do with some adjustment as well as pseudocode and a manner of providing an accurate systematic representation of header or data for example, I'll put the writeup here.

    I have also made a post here:
    http://forum.codecall.net/topic/779...assist-in-developing-a-compression-algorithm/


    If anyone is interested in taking this concept and coding the algorithms of each MODE and allowing them to work systematically based on the selected SCHEME, this could generally be used for compression as a tool and with an expectation of high results.


    For the archival scheme and its nature of repeating methods, I don't think compression tools like rar/7z go as far as this, and for a general streaming file it is expected to be highly compressed (from each MODE being optimized for ideal results) and accessible.

    Any other MODES to add which will compliment and inherit are also welcome :)
    My programming skills are low, however a general outline and pseudocode of expected results is there.


    Thanks.
     

    Attached Files:

  2. FAST6191

    FAST6191 Techromancer

    pip Reporter
    23,190
    8,942
    Nov 21, 2005
    I recall your earlier threads ( http://gbatemp.net/threads/another-attempt-at-file-compression-s.326985 , http://gbatemp.net/threads/file-sequence-hash-compression.324474/ , http://gbatemp.net/threads/i-have-pseudocode-for-experimental-file-compression.323185/ , http://gbatemp.net/threads/would-appreciate-some-help-making-an-i-think-awesome-program.321990/ ), you are doing better here but where the earlier stuff might have worked should you have a supercomputer this just looks like standard compression seen in the likes of 7zip.

    By the looks of things you are now experimenting with dictionary compression (what you have called set frames, also an aspect of in the file) and combining with sliding window or some form of RLE (granted RLE is considered a simplified version of sliding window and is quite similar to what you describe for 2).

    No levels? So you are going a bit at a time for your dictionary, possibly also doing bitwise operations...... that is going to be painfully slow. If not then adding that will increase compression, how much obviously varies but something at least and thus you are back at levels.
     
  3. SoraK05
    OP

    SoraK05 GBAtemp Regular

    Member
    155
    13
    Aug 24, 2007
    Kenya
    It isn't dictionary.
    It is getting what it can in alternate ways of detecting compression that compliments, and to get the most requires looking at specific bits.

    It will look for areas with common data that can be shrunk into less bits, have scans for removing sections that repeat in a row and then doing something like huffman on the remainder which is shrunk in general with common data and without repetition in a row to have repetitive coded strings.

    This is not a dictionary.
    This is rewriting areas as something else - in the case of frames, it is a number of bits that all sequences in the area will fit, like 10 variations fitting in 4 bits each, and in the case of 'throughout' (like huffman) specific strings (to bits to capture the most compression of exact sequences) a code tree of varying lengths with the shorter entries to represent the sequences that save the most bits.
    The only information in the header is sequences in general, mode type, offset where relevant and coded versions. It takes the exact data required and its representation.

    It is a combination of ways to detect compression which compliment - huffman alone will leave sequences which repeat in a row as well as be done on generally larger data - shrinking areas into a common 'section', removing repetition in a row in the section (which will be more that doing it on the whole file in general) and then looking for sequences throughout (on shrunk common data without sequences in a row which will be more compression than without the shrinking and consolidating of the common data, as well as making sure no general repetition of note will be rewritten in a row) should provide something generally more than 'best' of 7z/rar.

    Archival mode should provide something that 7z/rar don't do, since I believe they don't go 'this far' :)


    Huffman is something seen in many compression codecs (7z/rar).
    This one (MODE 3) is optimized with a temporary file for offset data to specifically target the sequences which will save the most bits (i.e. most compressible sequences) as opposed to 'frequency' which is more common. It will rescan over and over while blanking out offsets to avoid clashes to get the minimal number of sequences which save the most bits, and then organize them on a code tree (of sequences which read left to right will not get confused) and give the most bit saving sequences the least bit entries on the code tree. This is a thoroughly optimized huffman type method to get the most compression in one pass possible.


    The issue with huffman ('throughout', MODE 3) is that where there are sequences in a row with repetition, it will find a decent sequence to represent this and recode it on a code tree. The result is generally larger than 'in a row' (MODE 2) which will get the offset, sequence, and its repetition and slice the contents out of the file.
    For a 1GB file of 0 bits, 'in a row' (MODE 2) can store the file in a few bytes. With the huffman type algorithm, it will end up being a lot more (which can be shown in 7z/rar in general).

    Also shrinking data areas into consolidated sections is a general global step toward optimizing a file overall and preparing it for something like huffman, and with in a row to get what it can, it allows for a file to be prepped and inherit with more compression from consolidation of common shrunk data for the huffman type algorithm.


    Each MODE is designed to get the most compression possible for its seeking method - the frame method is meant to shrink the data in to the fewest and most bit saving sections of areas the file can be done in - the in a row method is meant to get all the minimal versions of a repeating sequence in a row, get its offset and its repetition and slice the contents to be optimal in the least bits of the sequence that repeats (in the case of 1GB file of 0 bits, it will get '0 bits' 'offset 0' 'repetition 1GB'<optimal>, and the throughout method to be optimal in least sequences (of common data with the sections done by the optimized section info) with save the most bits on a code tree designed to produce the smallest file and prioritizing most bit saving sequences to the least bit coding.

    Each MODE and way of detecting and storing compression which generally inherit and encourage more compression with each MODE in order is designed to give optimal results directly. There are no 'compression levels' since each MODE is optimal in its results and confirmation in advance that it will compress for its MODE.

    There are 'schemes' where if each MODE is done once where possible the file can immediately decode while taking into account general data for the decoding of MODE 3 information and then the MODE 1 information for the current section represented, injecting information with MODE 2 information where required, this is a 'streaming scheme'.
    Where each MODE is done repeatedly if possible (and I believe goes farther than other compression methods) 'layers' are built and require decoding of each layer and using the layer to get the next, as an 'archival' and 'thorough' scheme which should generally provide a lot of compression if possible to repeat MODES.

    Since the MODES are optimized for best results on each pass, the streaming scheme is generally and with inheritance able to have a high amount of compression and be relatively immediately accessible.


    The encoding and scanning part and requirement of temporary space is there, and the result is a generally highly compressed file (compared to general levels of rar/7z) which can decode immediately with some minor calculations.

    While I don't have an example, considering the extent that MODE 1 will go to prepare the file in sections of shrunk areas and treat them as consolidated for MODE 2 and 3, I believe this will go further than 7z/rar for the streaming scheme and more for archival.


    There's no compression 'level', but 'scheme' to compliment whether the result will be able to decode immediately or go 'all the way' with layers which decode eventually to the file.
    There's no set dictionary size - the required information is the offsets, section information, sequences, repetition, code tree information in the case of what is expected from MODES. It looks for specific bit information for overall optimal results of 'smallest file'.

    If looking at the examples of each, the MODES are optimized to give the result of the highest bits saved where possible to detect :)
    The aim is to systematically compress what can be done with different methods to detect and compress that compliment and inherit, so 'levels' is not something in mind - each MODE will do its optimal result where it can for getting compression where others won't, and compliment by preparing the file for a more ideal result for the next MODE.


    EDIT:
    Also, considering past attempts in general the aim was to avoid what 7z etc do and look for a means to do something else without calculations in mind.

    For a file of data on the whole as a string of 0 and 1 bits, I believe that compression in general can be determined by repetition of a sequence that can be stored in less bits to represent.
    These are 3 ways to determine repetition where possible, and thus each MODE will compress the result if it can be found in the most optimal way the file will allow.


    Beyond adding more means to detect and optimally compress for its means, there is general mathematic calculations.
    This will result in a file which generally has detectable repetition removed when the MODES are passed.
    The result should be thorough enough to possibly pass 7z/rar with its inheritance and thorough section creation for an overall global effect.

    Unlike the previous things, this is something that can be doable and I don't think is what those tools do (since I can see dictionary sizes and other options, which this doesn't care for). I also don't think there is something like archival, which can strip repetition in a row and where possible do it again from the file shuffling with the data removed for a further effect (on consolidated shrunk data), and a file which requires each layer to be constructed. I don't think 7z/rar go as far.


    From what I understand those tools have a balance of general time and size (with current HDD space) in mind - this has thoroughness and max results possible with each inheriting pass, and schemes for user preference of having the file be decodable with streaming and one MODE each in priority or archival and squeeze with multiple MODE passes in priority.


    EDIT:
    Considering the huffman method (MODE 3) is optimized with a temporary offset file and with thorough scans to get the exact most compressible sequences for the optimized code tree for these 'heaviest' sequences using the least bits, it should generally be more effective than other codecs which use only huffman, as it is 'the smallest result' it can be. Sections which will repeat will have overall coded strings which repeat with this method, and general areas could do with shrinking - it is more ideal to do this beforehand with the other MODES.



    A video like this which is mostly the same few colours could generally have the entire video contents put into one or two general shrunk 'sections' of areas which are offset on each frame, and with the 'in a row' and the 'general throughout' MODES on each section and its areas treated as a continuation, the total file size for the image data could be considerably compressed and be accessible and streamable. The audio data will also be compressed where detected, and the bitrate/resolution/fps could be in the additional information section of the header. Overall, this could be compressible and accessible, and more-so than conventional mpx codec (and lossless).

    Unlike other codecs, this shouldn't generally output a file in the same file size range such as mp3 for x minutes for example, and work on the file based on what it can find.